# TurboQuant M5 Max 128GB Stress Test — Large Model Validation

**Tom Turney**
Independent Researcher
GitHub: [@TheTom](https://github.com/TheTom)

---

## Abstract

We stress-test TurboQuant KV cache compression on Meta's Llama-3.1-70B-Instruct and CohereForAI Command-R+ 104B (both Q4_K_M) running on a single Apple M5 Max with 128GB unified memory. These are the largest models tested with TurboQuant to date.

Key findings:

1. **104B at full 128K native context on a laptop.** Command-R+ 104B with turbo3/turbo3 achieves PPL 4.024 at 128K context, using ~74GB of 128GB. 5.12x KV compression.

2. **Larger models tolerate symmetric turbo better.** 104B turbo3/turbo3 is +3.6% PPL (vs +11.4% on 70B). turbo4/turbo4 is +1.9%, nearly lossless.

3. **turbo3 prefill is faster than q8_0 at 32K context.** Consistent on both 70B (80.8 vs 75.2 t/s, +7.4%) and 104B (64.5 vs 62.3 t/s, +3.5%). Smaller KV cache = less memory bandwidth.

4. **NIAH retrieval is perfect.** 30/30 on 70B (q8_0 and turbo3). 10/10 on 104B turbo3 at 4K/8K.

5. **macOS context wall identified and fixed.** The default `iogpu.wired_limit_mb` caps GPU memory at ~75% of RAM. On 128GB, this causes Metal to stall at ~49K context on 70B+ models. Fix: `sudo sysctl iogpu.wired_limit_mb=122880`. One command, no reboot.

All tests used Metal flash attention with full GPU offload. Block size 128 throughout (5.12x compression for turbo3).

---

## 1. Setup

### 1.1 Hardware

| Component | Spec |
|-----------|------|
| SoC | Apple M5 Max |
| Memory | 128GB unified (LPDDR5X) |
| GPU Cores | 40-core |
| Backend | Metal with flash attention |
| OS | macOS |

### 1.2 Model

| Property | Value |
|----------|-------|
| Model | Meta-Llama-3.1-70B-Instruct |
| Weight quantization | Q4_K_M (GGUF, 40GB) |
| Layers | 80 |
| Attention heads | 64 |
| KV heads | 8 (GQA 8:1) |
| Head dimension | 128 |
| Native context | 128K |

### 1.3 Build

- Branch: `feature/turboquant-kv-cache`
- Block size: `QK_TURBO3=128`, `QK_TURBO2=128`
- Sparse V: enabled (default on M5+)
- Boundary V: not active (test predates auto-enable)
- Full GPU offload: `-ngl 99`

---

## 2. Perplexity

### 2.1 Short Context (512 tokens, 20 chunks, wikitext-2-raw)

| K | V | PPL | vs q8_0 | Status |
|---|---|-----|---------|--------|
| q8_0 | q8_0 | 3.257 | baseline | healthy |
| q8_0 | turbo4 | 3.301 | +1.3% | healthy |
| q8_0 | turbo3 | 3.325 | +2.1% | healthy |
| turbo4 | turbo4 | 3.461 | +6.3% | healthy |
| turbo3 | turbo3 | 3.629 | +11.4% | usable |
| turbo2 | turbo2 | 5.161 | +58.5% | degraded |

**Finding:** Llama-70B Q4_K_M tolerates symmetric turbo quantization across all formats. This contrasts with Qwen2.5-7B Q4_K_M, where symmetric turbo3/turbo3 produces catastrophic PPL (3556). The 70B model has sufficient capacity to absorb the quantization stacking that breaks smaller models.

turbo2/turbo2 shows significant degradation (+58.5%) but is not catastrophic — the model remains coherent unlike the Qwen2.5-7B case.

### 2.2 Long Context

| K | V | Context | Chunks | PPL |
|---|---|---------|--------|-----|
| q8_0 | q8_0 | 8K | 4 | 3.617 |
| turbo4 | turbo4 | 8K | 4 | 3.770 |
| turbo3 | turbo3 | 8K | 4 | 3.937 |
| turbo3 | turbo3 | 32K | 2 | 4.839 |
| q8_0 | q8_0 | 48K | 1 | 3.575 |
| turbo3 | turbo3 | 48K | 1 | 4.019 |

PPL remains healthy at all tested context lengths. The turbo3 PPL at 48K (4.019) is higher than q8_0 at 48K (3.575), consistent with the +11.4% gap observed at 512 context.

---

## 3. Speed

### 3.1 Short Context (512 tokens)

| K | V | Prefill (t/s) | Decode (t/s) |
|---|---|:-------------:|:------------:|
| q8_0 | q8_0 | 166.8 | 10.9 |
| q8_0 | turbo4 | 174.4 | 10.1 |
| q8_0 | turbo3 | 174.8 | 10.2 |
| turbo4 | turbo4 | 173.8 | 10.3 |
| turbo3 | turbo3 | 165.0 | 9.9 |
| turbo2 | turbo2 | 170.5 | 9.9 |

Speed is flat across all configs at short context. The 40GB model weights dominate memory bandwidth; the KV cache at 512 tokens is negligible.

### 3.2 8K Context

| K | V | Prefill (t/s) | Decode (t/s) |
|---|---|:-------------:|:------------:|
| q8_0 | q8_0 | 139.2 | 11.9 |
| turbo4 | turbo4 | 134.5 | 10.6 |
| turbo3 | turbo3 | 136.2 | 10.1 |

Still flat. KV cache at 8K is ~1.25GB (q8_0) or ~250MB (turbo3), both negligible vs 40GB weights.

### 3.3 32K Context

| K | V | Prefill (t/s) | Decode (t/s) |
|---|---|:-------------:|:------------:|
| q8_0 | q8_0 | 75.2 | 10.4 |
| turbo4 | turbo4 | 72.5 | 10.5 |
| turbo3 | turbo3 | 80.8 | 10.2 |

**turbo3 prefill is 7.4% faster than q8_0** (80.8 vs 75.2 t/s). At 32K context, the KV cache is large enough (~5GB for q8_0 vs ~1GB for turbo3) that reduced memory bandwidth during attention outweighs dequantization cost. This crossover point is consistent with the observation on smaller models (Qwen3.5-35B MoE on M1 Max, where turbo2 beats q8_0 at 65K prefill).

Decode remains flat at ~10 t/s across all configs. On a 70B model, decode is dominated by the 40GB weight read per token, not the KV cache.

> **Note:** llama-bench with default 5 repetitions hangs on 70B at 32K+ context. All 32K measurements use `-r 1`. Root cause unclear; appears to be Metal resource contention across reps.

---

## 4. Needle-In-A-Haystack (NIAH)

Kamradt single-needle methodology. 5 depths (0%, 25%, 50%, 75%, 100%) × 3 context lengths (4K, 8K, 16K) × 2 cache types.

### q8_0 (baseline)

| Depth | 4K | 8K | 16K |
|-------|:--:|:--:|:---:|
| 0% | PASS | PASS | PASS |
| 25% | PASS | PASS | PASS |
| 50% | PASS | PASS | PASS |
| 75% | PASS | PASS | PASS |
| 100% | PASS | PASS | PASS |

### turbo3 (5.12x compression)

| Depth | 4K | 8K | 16K |
|-------|:--:|:--:|:---:|
| 0% | PASS | PASS | PASS |
| 25% | PASS | PASS | PASS |
| 50% | PASS | PASS | PASS |
| 75% | PASS | PASS | PASS |
| 100% | PASS | PASS | PASS |

**30/30 pass. Zero difference between q8_0 and turbo3.** TurboQuant preserves retrieval accuracy at 5.12x KV cache compression on a 70B model.

---

## 5. Maximum Context

### 5.1 Memory at 48K Context

| Config | KV Cache (MiB) | Model + Context (MiB) | Free (MiB) |
|--------|:--------------:|:---------------------:|:----------:|
| q8_0/q8_0 | 8,160 | 48,991 | 61,108 |
| turbo3/turbo3 | ~3,000 | ~44,000 | ~66,000 |

With turbo3, the KV cache at 48K is 3GB instead of 8GB. Both fit comfortably in 128GB with 60+ GB to spare.

### 5.2 Context Wall (Identified and Fixed)

Without the fix, both q8_0 and turbo3 hang at 50K+ context:

| K | V | Context | Status |
|---|---|---------|--------|
| turbo3 | turbo3 | 48K | works (PPL 4.019) |
| q8_0 | q8_0 | 48K | works (PPL 3.575) |
| turbo3 | turbo3 | 50K | **HANGS** |
| turbo3 | turbo3 | 56K | **HANGS** |
| q8_0 | q8_0 | 64K | **HANGS** |

**Root cause:** macOS sets `recommendedMaxWorkingSetSize` to ~75% of physical RAM by default. On 128GB, this is ~96GB. When total GPU allocations (model + KV + compute buffers) exceed this limit, Metal's buffer allocation blocks indefinitely.

**Fix:** One command, no reboot required:

```bash
# Recommended: 90% of physical RAM (safe for sustained inference)
# Setting above 90% risks kernel panics under sustained load.

# 128GB Mac
sudo sysctl iogpu.wired_limit_mb=117964

# 96GB Mac
sudo sysctl iogpu.wired_limit_mb=88474

# 64GB Mac
sudo sysctl iogpu.wired_limit_mb=58982
```

With the fix applied, 70B runs at 64K context (PPL 4.135). See Section 8 for 104B results at 128K.

Isolation experiments confirmed `iogpu.wired_limit_mb` is the sole fix needed. `GGML_METAL_NO_RESIDENCY=1` and custom ubatch sizes had no measurable effect.

> **Note:** The original stress tests used 122880 (96%) without issues, but community testing (@treblewoe) reports kernel panics at sustained load above 90%. 90% is the recommended safe default.

---

## 6. GQA Impact on Compression Savings

Llama-3.1-70B uses GQA 8:1 (8 KV heads for 64 attention heads). This means the KV cache is already 1/8th the size it would be without GQA. TurboQuant's compression ratios are the same (5.12x for turbo3 at block_size=128), but the absolute memory savings are smaller:

| Config | KV at 48K | vs fp16 KV |
|--------|:---------:|:----------:|
| fp16 | ~15,360 MiB | 1.0x |
| q8_0 | 8,160 MiB | 1.9x |
| turbo3 | ~3,000 MiB | 5.1x |

On models without GQA (e.g., GPT-class with n_kv_heads = n_heads), TurboQuant's savings scale 8× larger in absolute terms.

---

## 7. Comparison with Smaller Models

| Model | Weights | Symmetric turbo3 PPL | Status |
|-------|---------|:--------------------:|--------|
| Qwen2.5-7B | Q4_K_M | 3,556 | catastrophic |
| Qwen2.5-1.5B | Q4_K_M | 8,641 | catastrophic |
| Mistral-24B | Q4_K_M | 4.987 | healthy |
| **Llama-70B** | **Q4_K_M** | **3.629** | **healthy** |
| **Command-R+ 104B** | **Q4_K_M** | **6.415** | **healthy** |

The Q4_K_M symmetric turbo sensitivity appears to be model-family-dependent, not purely size-dependent. Qwen2.5 is sensitive at all sizes. Mistral, Llama, and Command-R+ tolerate it. Larger models show better tolerance (104B +3.6% vs 70B +11.4%). For sensitive models, asymmetric q8_0-K + turbo-V is the recommended path.

---

## 8. Command-R+ 104B Q4_K_M

### 8.1 Model

| Property | Value |
|----------|-------|
| Model | CohereForAI Command-R+ 104B |
| Weight quantization | Q4_K_M (~59GB on disk, 58.4 GiB loaded) |
| Layers | 64 |
| Attention heads | 96 |
| KV heads | 8 (GQA 12:1) |
| Head dimension | 128 |
| Native context | 128K |

Metal configuration: `iogpu.wired_limit_mb=122880` applied before testing.

### 8.2 Perplexity (512 tokens, 20 chunks, wikitext-2-raw)

| K | V | PPL | vs q8_0 | Status |
|---|---|-----|---------|--------|
| q8_0 | q8_0 | 6.192 | baseline | healthy |
| q8_0 | turbo4 | 6.211 | +0.3% | healthy |
| q8_0 | turbo3 | 6.296 | +1.7% | healthy |
| q8_0 | turbo2 | 6.678 | +7.9% | healthy |
| turbo4 | turbo4 | 6.312 | +1.9% | healthy |
| turbo3 | turbo3 | 6.415 | +3.6% | healthy |
| turbo2 | turbo2 | 7.049 | +13.8% | usable |

**104B tolerates symmetric turbo even better than 70B.** turbo3/turbo3 is +3.6% (vs +11.4% on 70B). turbo4/turbo4 is nearly lossless at +1.9%. Bigger models = more headroom for quantization stacking.

### 8.3 Speed

#### 512 Context

| K | V | Prefill (t/s) | Decode (t/s) |
|---|---|:-------------:|:------------:|
| q8_0 | q8_0 | 125.0 | 8.5 |
| q8_0 | turbo4 | 128.5 | 7.6 |
| q8_0 | turbo3 | 129.4 | 7.3 |
| q8_0 | turbo2 | 127.3 | 7.8 |
| turbo4 | turbo4 | 118.0 | 7.9 |
| turbo3 | turbo3 | 125.7 | 6.8 |
| turbo2 | turbo2 | 126.5 | 7.7 |

#### 8K Context

| K | V | Prefill (t/s) | Decode (t/s) |
|---|---|:-------------:|:------------:|
| q8_0 | q8_0 | 101.6 | 7.6 |
| q8_0 | turbo3 | 100.1 | 8.1 |
| turbo3 | turbo3 | 98.4 | 7.6 |

#### 32K Context

| K | V | Prefill (t/s) | Decode (t/s) |
|---|---|:-------------:|:------------:|
| q8_0 | q8_0 | 62.3 | 8.3 |
| turbo3 | turbo3 | 64.5 | 7.8 |

turbo3 prefill faster than q8_0 again at 32K (64.5 vs 62.3 t/s, +3.5%). Same crossover pattern as 70B.

### 8.4 Context Scaling (the big test)

| K | V | Context | PPL | Pass time | Status |
|---|---|---------|-----|-----------|--------|
| turbo3 | turbo3 | 48K | 3.672 | 931s | works |
| turbo3 | turbo3 | 64K | 4.321 | 1481s | works |
| turbo3 | turbo3 | 96K | 4.170 | 2966s | works |
| turbo3 | turbo3 | 128K | 4.024 | 4996s | **works** |

**104 billion parameters. 128K full native context. On a MacBook. PPL 4.024.** turbo3 (5.12x KV compression). Peak memory ~74GB of 128GB. Pass time 83 minutes.

The context wall that blocked 70B at 49K is fully resolved with `iogpu.wired_limit_mb=122880`.

### 8.5 NIAH (Needle-In-A-Haystack)

Kamradt single-needle methodology. 104B decode is slow (~8 t/s), so 16K tests timed out.

#### turbo3 (5.12x compression)

| Depth | 4K | 8K | 16K |
|-------|:--:|:--:|:---:|
| 0% | PASS | PASS | timeout |
| 25% | PASS | PASS | — |
| 50% | PASS | PASS | — |
| 75% | PASS | PASS | — |
| 100% | PASS | PASS | — |

**10/10 pass at 4K and 8K.** 16K timed out due to slow decode, not a retrieval failure.

---

## 9. Limitations

1. **Two models tested.** Results are for Llama-3.1-70B-Instruct and Command-R+ 104B, both Q4_K_M. Other 70B+ models (Qwen-72B, DeepSeek-67B) may behave differently.

2. **Q4_K_M weights only.** Q8_0 weights on 70B would likely show even better turbo PPL, but require ~70GB for weights alone, leaving limited room for KV cache on 128GB.

3. **Metal only.** CUDA and HIP backends were not tested at this model size.

4. **Single run per config.** PPL measurements are single-run, not averaged across multiple seeds. Error bars are from the wikitext-2 chunk variance, not run-to-run variance.

5. **Boundary V not tested.** The auto-enable for turbo2-V was added after this test. Boundary V on 70B turbo2 would likely recover significant quality from the +58.5% degradation.

6. **104B NIAH limited.** 16K context timed out due to slow decode (~8 t/s). Not a retrieval failure. Requires longer query timeout for full coverage.

7. **macOS GPU memory fix required for 70B+ at long context.** The `iogpu.wired_limit_mb` sysctl must be set to ~90% of physical RAM (community testing confirmed >90% risks kernel panics under sustained load). Resets on reboot.

---

## Addendum: Independent Validation (2026-03-31)

The key findings from this stress test have been independently confirmed:

**Asymmetric K/V rescue (Sections 2, 8):**
- **@HyperionMS2040** — RTX 3090, 10-model CUDA sweep (2026-03-30): q8_0/turbo4 "lossless across all tested architectures" (4 architectures validated). Confirmed model-family-dependent sensitivity: Qwen2.5-7B symmetric turbo3 catastrophic (PPL 3,472) vs Llama 3.1 8B symmetric turbo3 +6.4%.
- **@sztlink** — RTX 4090, Qwen3-4B (2026-03-31): Full asymmetric matrix confirms V compression is completely free (1.000 cosine similarity with fp16-K + 2bit-V). All degradation from K.
- **AMD HIP** — RX 9070 XT, gfx1201 (2026-03-29): Asymmetric q8_0/turbo4 confirmed at +1.0% PPL. (Author's own testing on Windows AMD hardware.)

**turbo3 prefill faster than q8_0 at long context (Sections 3, 8):**
- **@spiritbuun** — RTX 3090 (X collaborator, ~2026-03-28): Reported near-parity prefill speed with dequant-then-MMA path on CUDA.
- **@AmesianX** — Blackwell DGX Spark (2026-03-30): turbo decode 63.5 t/s, faster than q8_0 (50.1 t/s) at 8K context.
- **@dusterbloom** — RTX 3090 (2026-03-30): Decode faster on 4/5 models with TBQ3 Flash Attention.
- **@HyperionMS2040** — RTX 3090 (2026-03-30): Asymmetric q8_0-K/turbo3-V 14% faster decode than symmetric turbo3/turbo3 (120.9 vs 106 t/s).

**macOS GPU memory wall (Section 5):**
- **@treblewoe** (X collaborator, 2026-03-31): Confirmed the wall exists. Reported kernel panics at >90% allocation under sustained load (Minimax M2.5 3-bit starting at 100GB). Recommended 90% as safe ceiling. Docs updated accordingly.

---

## References

- TurboQuant paper: [arXiv 2504.19874](https://arxiv.org/abs/2504.19874) (ICLR 2026)
- TurboQuant+ implementation: [github.com/TheTom/turboquant_plus](https://github.com/TheTom/turboquant_plus)
- llama.cpp fork: [github.com/TheTom/llama-cpp-turboquant](https://github.com/TheTom/llama-cpp-turboquant)
- Block size optimization: [block-size-experiment.md](block-size-experiment.md)
- Sparse V dequant: [sparse-v-dequant.md](sparse-v-dequant.md)
- Configuration recommendations: [turboquant-recommendations.md](../turboquant-recommendations.md)