# Asymmetric K/V Cache Compression: Why V is Free and K is Everything

**Tom Turney**
Independent Researcher
GitHub: [@TheTom](https://github.com/TheTom)

---

## Abstract

The TurboQuant paper (ICLR 2026) tests KV cache compression with symmetric configurations (same quantization for both K and V). In practice, symmetric turbo quantization is catastrophic on certain model families with low-bit weight quantization (Q4_K_M). We show that this failure is entirely K-side: V compression has zero measurable effect on attention quality when K precision is maintained.

This asymmetry arises from the attention mechanism itself. K determines attention routing via softmax, which amplifies small errors exponentially. V errors scale linearly through the weighted sum. On Qwen2.5-7B Q4_K_M, symmetric turbo3 produces PPL 3,556 (catastrophic), while asymmetric q8_0-K + turbo3-V produces PPL 6.71 (+2.0% vs baseline). Same total compression budget, opposite quality outcomes.

We validate this finding across 7 models (1.5B to 104B), 3 weight quantizations (Q4_K_M, Q6_K, Q8_0), 3 GPU backends (Metal, CUDA, HIP), and 2 Apple Silicon generations (M2 Pro, M5 Max). The practical recommendation is simple: compress V maximally, spend bits on K.

---

## 1. Introduction

KV cache compression reduces memory consumption during LLM inference, enabling longer context windows and larger models on consumer hardware. Google's TurboQuant (ICLR 2026) achieves 4.6-5.12x compression via Walsh-Hadamard rotation and polar quantization. The paper validates quality using fp16 model weights.

In practice, most users run quantized model weights (Q4_K_M, Q6_K) to fit models in limited VRAM. When TurboQuant KV cache compression is applied on top of already-quantized weights, the errors compound. We refer to this as **quantization stacking**: weight quantization error + KV cache quantization error.

The question is: does stacking affect K and V equally?

---

## 2. The Asymmetry: Softmax Amplification

The attention mechanism computes:

$$O = \text{softmax}\left(\frac{QK^T}{\sqrt{d}}\right) \cdot V$$

K determines **which tokens** receive attention weight via softmax. V determines **what information** flows from those tokens. These are fundamentally different roles:

- **K errors** shift the Q*K dot product scores. Softmax amplifies small score differences exponentially: a perturbation of $\epsilon$ in logit space produces an $O(e^\epsilon)$ change in the attention distribution. On sensitive models, this can flip which tokens dominate the output.

- **V errors** scale linearly through the weighted sum. If a V element is off by $\epsilon$, the output error is bounded by $w_i \cdot \epsilon$ where $w_i$ is the attention weight for that position. For non-dominant positions (most of them at long context), $w_i \approx 0$ and the error vanishes.

This predicts that K precision should matter far more than V precision for output quality. Our experiments confirm this.

---

## 3. Results

All PPL measured on wikitext-2-raw (512 context, 20 chunks) unless noted. All tests on Metal with flash attention, full GPU offload.

### 3.1 The Catastrophic Case: Qwen2.5-7B Q4_K_M

This model demonstrates the asymmetry most dramatically.

| K | V | M5 Max PPL | M2 Pro PPL | vs q8_0 | Status |
|---|---|-----------|-----------|---------|--------|
| q8_0 | q8_0 | 6.577 | 6.579 | baseline | healthy |
| q8_0 | turbo4 | 6.644 | 6.603 | +1.0% | healthy |
| q8_0 | turbo3 | 6.707 | 6.715 | +2.0% | healthy |
| q8_0 | turbo2 | 6.911 | — | +5.1% | healthy |
| turbo4 | turbo4 | 217.7 | 227.5 | catastrophic | avoid |
| turbo3 | turbo3 | 3,556 | 3,778 | catastrophic | avoid |

**V compression is essentially free**: q8_0-K + turbo2-V (2-bit V, 6.4x V compression) gives +5.1% PPL. The model is fully coherent.

**K compression is catastrophic**: turbo4-K + turbo4-V (same bits on both) gives PPL 218. The model is incoherent.

**Same total bits, opposite results.** The difference is 500x in PPL depending on which cache you compress.

Cross-hardware matched: M2 Pro and M5 Max produce equivalent quality patterns.

### 3.2 Qwen2.5-1.5B Q4_K_M

| K | V | PPL | vs q8_0 | Status |
|---|---|-----|---------|--------|
| turbo3 | turbo3 | 8,641+ | catastrophic | avoid |

Symmetric turbo is catastrophic on Qwen2.5 at all sizes, not just 7B. The sensitivity is architecture-dependent.

### 3.3 phi-4-14B (Q8_0 Weights)

| K | V | M5 Max PPL | vs q8_0 | Status |
|---|---|-----------|---------|--------|
| q8_0 | q8_0 | 4.690 | baseline | healthy |
| q8_0 | turbo4 | 4.702 | +0.3% | healthy |
| q8_0 | turbo3 | 4.742 | +1.1% | healthy |
| q8_0 | turbo2 | 4.835 | +3.1% | healthy |
| turbo4 | turbo4 | 4.770 | +1.7% | healthy |
| turbo3 | turbo3 | 4.886 | +4.2% | healthy |

With Q8_0 weights, both symmetric and asymmetric work. The quantization stacking effect only appears when weight quantization is already aggressive (Q4_K_M). V compression gradient is clear: turbo4-V +0.3%, turbo3-V +1.1%, turbo2-V +3.1%. All healthy.

### 3.4 Mistral-Small-24B Q4_K_M

| K | V | PPL | vs q8_0 | Status |
|---|---|-----|---------|--------|
| turbo3 | turbo3 | 4.987 | healthy | healthy |

Q4_K_M symmetric turbo is not universally broken. Mistral-24B handles it fine. The sensitivity is model-family-dependent, not purely a function of weight quantization.

### 3.5 Llama-3.1-70B Q4_K_M

| K | V | PPL | vs q8_0 (3.257) | Status |
|---|---|-----|---------|--------|
| q8_0 | turbo4 | 3.301 | +1.3% | healthy |
| q8_0 | turbo3 | 3.325 | +2.1% | healthy |
| q8_0 | turbo2 | 3.568 | +9.5% | healthy |
| turbo4 | turbo4 | 3.461 | +6.3% | healthy |
| turbo3 | turbo3 | 3.629 | +11.4% | usable |
| turbo2 | turbo2 | 5.161 | +58.5% | degraded |

**Asymmetric rescue**: q8_0/turbo2 = +9.5% vs symmetric turbo2/turbo2 = +58.5%. 6x improvement in quality degradation from the same V format, just by preserving K precision.

Llama-70B tolerates symmetric turbo3 (+11.4%), but asymmetric q8_0/turbo3 is better (+2.1%). Larger models absorb quantization stacking better than smaller ones.

### 3.6 Command-R+ 104B Q4_K_M

| K | V | PPL | vs q8_0 (6.192) | Status |
|---|---|-----|---------|--------|
| q8_0 | turbo4 | 6.211 | +0.3% | healthy |
| q8_0 | turbo3 | 6.296 | +1.7% | healthy |
| q8_0 | turbo2 | 6.678 | +7.9% | healthy |
| turbo4 | turbo4 | 6.312 | +1.9% | healthy |
| turbo3 | turbo3 | 6.415 | +3.6% | healthy |
| turbo2 | turbo2 | 7.049 | +13.8% | usable |

104B tolerates symmetric turbo even better than 70B. turbo4/turbo4 is +1.9%, nearly lossless. The trend is clear: bigger models have more capacity to absorb quantization stacking.

### 3.7 AMD HIP Validation (RX 9070 XT)

| K | V | PPL | vs q8_0 (7.794) | Status |
|---|---|-----|---------|--------|
| q8_0 | turbo4 | 7.876 | +1.0% | healthy |
| turbo4 | turbo4 | 401.4 | catastrophic | avoid |
| turbo3 | turbo3 | 81,277 | catastrophic | avoid |

Asymmetric q8_0/turbo4 confirmed on AMD. Symmetric Q4_K_M failure is consistent across all three GPU vendors (Metal, CUDA, HIP).

---

## 4. The Quantization Stacking Model

Why do some models survive symmetric turbo on Q4_K_M while others don't?

The hypothesis: K cache quantization error compounds multiplicatively with weight quantization error in the attention logit computation. The Q*K dot product involves both quantized-weight Q projections and quantized-cache K values. When both are lossy, the logit errors can exceed softmax's tolerance threshold.

The evidence supports this:

| Model | Weights | Symmetric turbo3 PPL | Status |
|-------|---------|:--------------------:|--------|
| Qwen2.5-1.5B | Q4_K_M | 8,641 | catastrophic |
| Qwen2.5-7B | Q4_K_M | 3,556 | catastrophic |
| Mistral-24B | Q4_K_M | 4.987 | healthy |
| Llama-70B | Q4_K_M | 3.629 | healthy |
| Command-R+ 104B | Q4_K_M | 6.415 | healthy |

The sensitivity is model-family-dependent. Qwen2.5 is sensitive at all sizes. Llama, Mistral, and Cohere are tolerant. Within tolerant families, larger models absorb stacking better (104B: +3.6% vs 70B: +11.4%).

V compression adds negligible error regardless of weight quantization. This is because V errors don't pass through softmax and don't interact multiplicatively with weight quantization in the same way.

---

## 5. Practical Recommendations

### For any model you haven't tested

```bash
# Safe default: asymmetric, full K precision, compressed V
llama-server -m model.gguf -ctk q8_0 -ctv turbo4 -fa on
```

### For models validated as symmetric-tolerant

```bash
# Symmetric: more compression, validated on Llama/Mistral/Cohere 24B+
llama-server -m model.gguf -ctk turbo3 -ctv turbo3 -fa on
```

### For maximum V compression

```bash
# Asymmetric with turbo2-V (Boundary V auto-enabled)
llama-server -m model.gguf -ctk q8_0 -ctv turbo2 -fa on
```

### Decision tree

1. Q8_0+ weights? Symmetric turbo works. Use `-ctk turbo4 -ctv turbo4` or `-ctk turbo3 -ctv turbo3`.
2. Q4_K_M weights, unknown model? Start asymmetric: `-ctk q8_0 -ctv turbo4`.
3. Q4_K_M weights, Qwen2.5? Must use asymmetric. Symmetric is catastrophic.
4. Q4_K_M weights, Llama/Mistral/Cohere 24B+? Symmetric likely works, but validate PPL first.

---

## 6. Compression Analysis

Asymmetric K/V trades some K-side compression for quality safety. The V-side savings still dominate for long context:

| Config | K bpv | V bpv | KV compression vs fp16 | Quality |
|--------|-------|-------|:----------------------:|---------|
| q8_0/q8_0 | 8.5 | 8.5 | 1.9x | baseline |
| q8_0/turbo4 | 8.5 | 4.25 | 2.5x | +0.3-1.3% |
| q8_0/turbo3 | 8.5 | 3.125 | 2.8x | +1.1-2.1% |
| q8_0/turbo2 | 8.5 | 2.125 | 3.0x | +3.1-9.5% |
| turbo3/turbo3 | 3.125 | 3.125 | 5.1x | model-dependent |
| turbo4/turbo4 | 4.25 | 4.25 | 3.8x | +1.7-6.3% |

Even asymmetric q8_0/turbo3 achieves 2.8x total KV compression. At 128K context on a 70B model, that saves ~5GB of KV cache memory.

---

## 7. Independent Validation

These findings have been independently confirmed by multiple researchers:

**@sztlink (Felipe Sztutman)** — Qwen3-4B, RTX 4090, tonbistudio/turboquant-pytorch (2026-03-31):
- fp16-K + 2bit-V: 1.000 cosine similarity, 100% top-1 match
- All degradation from K compression, V has zero effect
- Boundary layer gap scales with K compression (-0.001 at 4-bit K, -0.010 at 2-bit K)
- [Layer 0 K isolation test](https://github.com/ggml-org/llama.cpp/discussions/20969#discussioncomment-16402887): protecting boundary K layers (fp16) did NOT reduce the gap — gap grew more negative, meaning boundary layers already compress *better* than middle layers. Mechanistic explanation: extreme K norms (146.8 at layer 0) produce tightly Gaussian distributions after normalization, which are ideal for Lloyd-Max codebooks. Middle layers (K norm 20–40) have more variable distributions and compress slightly less accurately. Confirms uniform K compression is stable — no per-layer K precision tuning needed

**@HyperionMS2040** — 10-model CUDA sweep, RTX 3090 (2026-03-30):
- Qwen2.5-7B symmetric turbo3: catastrophic (PPL 3,472). Llama 3.1 8B symmetric turbo3: +6.4%
- q8_0/turbo4 "lossless across all tested architectures" (4 architectures)
- Confirms model-family-dependent sensitivity pattern

**AMD HIP validation** — RX 9070 XT, gfx1201 (2026-03-29):
- Asymmetric q8_0/turbo4 confirmed: +1.0% PPL
- Symmetric catastrophic: PPL 81,277 (turbo3/turbo3 on Qwen2.5-7B Q4_K_M)
- Consistent with Metal and CUDA findings
- (Author's own testing on Windows AMD hardware)

**@jesusmb1995** — Vulkan backend, Mistral-7B Q4_K_S (2026-03-31):
- Mixed K/V Vulkan results support asymmetric thesis: q8_0-K + pq3_0-V = PPL 6.9426 (+0.64%), f16-K + pq4_0-V = PPL 6.8901 (-0.12%)
- V compression nearly free when K precision maintained, consistent across a fourth GPU backend (Vulkan)
- Note: uses `pq`/`tbq` type naming (jesusmb1995's Vulkan implementation), same underlying WHT + PolarQuant algorithm

**@stragulus** — Vulkan, AMD Radeon 7900 XTX (2026-03-31):
- Independent confirmation on jesusmb1995's Vulkan branch: q8_0-K + pq3_0-V = PPL 6.9226 (+0.37%), f16-K + pq4_0-V = PPL 6.8837 (-0.19%)
- V compression free on fifth hardware/backend combination (Vulkan + AMD discrete)

**@sztlink (Felipe Sztutman)** — [Qwen3-30B-A3B Q4_K_M, RTX 4090, AmesianX v1.2.0](https://github.com/ggml-org/llama.cpp/discussions/20969#discussioncomment-16403244) (2026-04-01):
- Production PPL validation (wikitext-2-raw, ctx=512, 20 chunks). First SM89 (Ada) data for this model
- q8_0/tbq3 (asymmetric): PPL 7.5910 (+0.57% vs f16 7.5477) — tightest asymmetric result to date, confirms "V is free" on MoE
- tbq3/tbq3 (symmetric): PPL 9.5221 (+26.16%) — catastrophic. Extends Qwen symmetric sensitivity from Qwen2.5 dense to **Qwen3 MoE**
- tbq4/tbq4 (symmetric): PPL 7.9950 (+5.93%) — healthy, matching q4_0. The turbo4-K catastrophic failures reported by zekrom-vale were caused by kernel bugs in TheTom's fork (documented in turbo4-resurrection paper) and head_dim misdetection in AmesianX's fork (fixed in v1.3.0), not an inherent turbo4 or IQ4_XS issue. AmesianX independently confirmed turbo4-K works correctly: PPL 6.73 (+7.5%) on Qwen3-30B-A3B and 6.20 (+0.6%) on Qwen3.5-27B distill
- Adds Qwen3 MoE to Section 4 sensitivity table: Qwen family is sensitive across architectures (dense and MoE), not just Qwen2.5

**@Madreag** — [Optimized CUDA fork](https://github.com/Madreag/turbo3-cuda/tree/release/cuda-optimized), 4 GPUs (SM86x2/SM89/SM120), 1,351+ iterations (2026-04-01):
- Full asymmetric K/V PPL matrix (ctx=512, Qwen3.5-27B Q6_K): **V type dominates PPL** (columns vary more than rows). K=turbo3/V=turbo3 (6.7251) approximately equals K=q8_0/V=turbo3 (6.6885). Independent CUDA confirmation that K precision matters less than V precision for quality, and that asymmetric is not needed on Q6_K+ weights
- [Detailed results](https://github.com/ggml-org/llama.cpp/discussions/20969#discussioncomment-16404751) with KL divergence, NIAH, and speed across 4 GPUs and 3 architecture generations

**@sjoerdmaessen (Sjoerd Maessen)** — [Qwen3.5-122B-A10B Q5_K_S, 2x NVIDIA L40S 48GB (SM89)](https://github.com/ggml-org/llama.cpp/discussions/20969#discussioncomment-16412852) (2026-04-01/02):
- **CORRECTION (2026-04-02):** Asymmetric q8_0-K / turbo3-V produces **corrupt output** — literal U+003F (`?`) characters. Speed measurements were accurate (61.1 t/s) but content is garbage. [Discovered in production](https://github.com/ggml-org/llama.cpp/discussions/20969#discussioncomment-16418601) when checking actual chatbot responses (Charles, Matrix bot). Speed-only benchmarks masked the issue
- Symmetric turbo3/turbo3 **works correctly**: 58 t/s, verified with factual queries, counting, reasoning in Dutch. Switching from `-ctk q8_0 -ctv turbo3` to `-ctk turbo3 -ctv turbo3` immediately produces coherent output
- Asymmetric q8_0/turbo2 **likely same issue** (not content-verified)
- **Root cause under investigation.** Likely related to V operations incorrectly gated on `k->type` (see issue #42). Three instances identified in `llama-graph.cpp` where V unpad checks `k->type` instead of `v->type`. Fix exists on branch `fix/turbo-v-unpad-gate` but was not merged to main. Note: Qwen3.5-122B has head_dim=128 (no padding needed), so the unpad bug alone may not explain this — additional asymmetric code path bugs may exist
- Speed measurements from initial testing remain accurate: asymmetric 61.1 t/s (100% of q8_0), symmetric 58 t/s (-6.4%). The K-side dequant speed penalty is real
- **Updated production config:** turbo3/turbo3 symmetric, 2x104K dual-slot (reduced from 2x128K due to larger compute buffer with symmetric K), parallel 2, `MTMD_BACKEND_DEVICE=CUDA1` for vision
- First 122B model tested, first L40S (data center Ada) validation, first multi-GPU split validation
- `MTMD_BACKEND_DEVICE=CUDA1` discovery: moving mmproj to GPU1 freed GPU0 VRAM, enabling dual-slot + vision. Valuable independent of K/V config

**@mudler (Ettore Di Giacinto)** — [APEX launch](https://github.com/mudler/apex-quant), [LocalAI](https://github.com/mudler/LocalAI) (44.7k stars) (2026-04-01):
- Released APEX (Adaptive Precision for EXpert Models), a MoE-aware mixed-precision weight quantization method for Qwen3.5-35B-A3B
- Explicitly recommends pairing APEX with TurboQuant KV cache compression in the [launch announcement](https://huggingface.co/mudler/Qwen3.5-35B-A3B-APEX-GGUF). First major open source project to officially integrate TurboQuant into their recommended stack
- APEX + TurboQuant at 8K context: +14% prompt processing speedup across all APEX tiers, zero quality loss, 4.6x KV cache compression
- APEX Mini (12.2 GB) + TurboQuant = 35B MoE at 8K context on a 16GB consumer GPU
- Independently confirms boundary layer sensitivity for weights: "first and last 5 transformer layers are 10x more sensitive than the middle," consistent with our Boundary V finding for KV cache

---

## 8. Limitations

1. **Sensitivity root cause not proven.** We observe that Qwen2.5 is sensitive and Llama/Mistral are not, but we have not identified the specific architectural feature that causes this. Possible factors: attention head count, KV head ratio, weight distribution, or training procedure.

2. **PPL is a proxy.** Perplexity on wikitext-2 at 512 context is a noisy single metric. Task-specific benchmarks (reasoning, code, retrieval) may show different sensitivity patterns.

3. **Limited model families.** Tested on Qwen2.5, Llama-3.1, Mistral, Command-R+, phi-4, and Qwen3.5. Other architectures (DeepSeek, Gemma, GPT-class) are untested.

4. **Metal primary, CUDA/HIP secondary.** Most experiments run on Apple Silicon Metal. CUDA and HIP results confirm the pattern but with fewer model/config combinations tested.

---

## 9. Conclusion

V cache compression is effectively free. K cache precision is the dominant lever for attention quality. This is a direct consequence of the attention mechanism: K errors are amplified exponentially through softmax, while V errors scale linearly through the weighted sum.

The practical implication is simple: use asymmetric K/V configs. Compress V aggressively (turbo2 or turbo3), keep K at q8_0 or higher. This gives meaningful compression (2.5-3.0x total KV) with minimal quality loss (+0.3-2.1% PPL) on all tested models, including those where symmetric turbo fails catastrophically.

For models validated as symmetric-tolerant (Llama, Mistral, Cohere at 24B+), symmetric turbo3/turbo3 offers better compression (5.1x) at acceptable quality (+3.6-11.4% PPL). But asymmetric is always the safe default.

---

## 10. Prior Discussion

The asymmetric K/V findings were first shared publicly in the [llama.cpp TurboQuant discussion thread](https://github.com/ggml-org/llama.cpp/discussions/20969) starting March 26, 2026. Initial recommendations for asymmetric configs and the observation that K precision dominates quality were posted there before being formalized in this paper.

---

## References

- TurboQuant paper: [arXiv 2504.19874](https://arxiv.org/abs/2504.19874) (ICLR 2026)
- TurboQuant+ implementation: [github.com/TheTom/turboquant_plus](https://github.com/TheTom/turboquant_plus)
- llama.cpp fork: [github.com/TheTom/llama-cpp-turboquant](https://github.com/TheTom/llama-cpp-turboquant)
- Sparse V dequantization: [sparse-v-dequant.md](sparse-v-dequant.md)
- Boundary V: [layer-aware-v-compression.md](layer-aware-v-compression.md)
- M5 Max stress test: [m5-max-stress-test.md](m5-max-stress-test.md)
- Configuration recommendations: [turboquant-recommendations.md](../turboquant-recommendations.md)