# TurboQuant Decode Speed: Why Some Hardware Struggles

## The Problem

turbo3 decode speed varies dramatically across Apple Silicon generations:

| Hardware | Decode vs q8_0 (short) | Decode vs q8_0 (16K) | Has Tensor API |
|----------|----------------------|---------------------|---------------|
| M5 Max | 0.89x | ~0.82x | yes |
| M2 Pro | 0.73x | 0.67x (with 4-mag fix) | no |
| M1 Max | 0.84x | 0.45x (reported) | no |

M5 Max has near-parity. M2/M1 fall off a cliff at long context. This doc explains why and what we did about it.

## Root Cause: Constant Memory LUT Divergence

turbo3 dequant uses an 8-entry centroid lookup table (LUT) in Metal `constant` memory. Each element's value is determined by a 3-bit index (2-bit qs + 1-bit sign), so each thread in a 32-thread simdgroup may access a different LUT entry.

This "divergent access" pattern is where hardware generations differ:

- **M5 Max (Apple10)**: Efficient constant cache handles 8-way divergent access with minimal penalty. LUT costs ~14% of decode time.
- **M2 Pro (Apple8)**: Older constant cache serializes divergent access. LUT costs ~25% of decode time — **2x worse than M5**.
- **M1 Max (Apple7)**: Even worse, likely 30%+ LUT cost based on reported numbers.

## How We Know: Profiling Isolation

We added `TURBO_PROFILE_MODE` (0-4) to strip away dequant layers one at a time:

| Mode | What | M5 (% of ceiling) | M2 (% of ceiling) |
|------|------|-------------------|-------------------|
| 1 | No dequant at all | 78.9 (100%) | 24.5 (100%) |
| 2 | + read norm only | 75.1 (95%) | 22.1 (90%) |
| 4 | + read all bytes | 75.2 (95%) | 21.9 (89%) |
| 3 | + qs extraction + LUT | 64.9 (82%) | 16.4 (67%) |
| 0 | + signs + full LUT | 59.2 (75%) | 14.0 (57%) |

**Key finding: No-dequant turbo3 is 12% FASTER than q8_0 on M2 Pro** (24.5 vs 21.9) because the compressed cache moves less data over the memory bus. The compression FORMAT is not the problem. The dequant COMPUTATION is.

**The LUT accounts for:**
- M5: 13.7% of ceiling (Mode 4 → Mode 3)
- M2: 25.1% of ceiling — **2x worse**

Reading the bytes (norm + qs + signs) costs ~10% on both chips. That's not the bottleneck.

## What We Tried (7 Approaches)

| # | Approach | M2 8K tok/s | vs Main | Why it works/doesn't |
|---|----------|-------------|---------|---------------------|
| 1 | **4-mag LUT + XOR sign** | **15.1** | **+38%** | **Halves constant addresses (4 vs 8). Sweet spot on M2.** |
| 2 | Batched byte extract | 13.7 | +25% | Better byte reading, still 8 LUT addresses |
| 3 | Deferred norm multiply | 12.9 | +18% | Loses ILP — per-element norm hides LUT latency |
| 4 | 2-pair half2 LUT | 12.0 | +10% | Ternary overhead exceeds savings from 2 addresses |
| 5 | Select chain (zero LUT) | 11.9 | +9% | Too much ALU — branches on M2 GPU |
| 6 | Bit-arithmetic | 11.6 | +6% | Pure ALU, zero memory — but ALU cost too high |
| 7 | Non-vec FA (nl=2) | 10.2 | -7% | Non-vec kernel not designed for single-token decode |

Also tested on M5 Max (no help needed):
- float cn[8] registers: 75.2 (spills to stack on Metal)
- half cn[8] registers: 74.4 (also spills)
- Split 2×4 half LUT: 74.0 (branch overhead)

## The Fix: Auto-Detection

The build auto-detects hardware at Metal library compile time:

```
M1/M2/M3/M4 (has_tensor=false) → TURBO_USE_4MAG=1 → 4-entry magnitude LUT
M5+          (has_tensor=true)  → TURBO_USE_4MAG=0 → 8-entry full LUT
```

Each chip gets its optimal dequant path. No user configuration needed.

### How 4-mag LUT works

The 8 centroids have structure: 4 magnitudes × 2 signs. We split the 3-bit index:
- **2-bit qs** → selects magnitude from 4-entry LUT (4 possible constant addresses)
- **1-bit sign** → reverses magnitude index via XOR, then negates via ALU multiply
- **Norm** → multiplied per-element (provides ALU work that hides constant memory latency)

The XOR trick: for negative values (sign=0), the magnitude index is reversed (qs=0 → highest magnitude). `qs ^ 0x3` flips the 2-bit index without a branch.

### Critical finding: On M2, branches cost MORE than divergent constant reads

We tested 9 approaches total. The pattern is clear:

| Approach | M2 8K | Constant reads | Branches | Result |
|----------|-------|---------------|----------|--------|
| **4-mag LUT** | **15.1** | **4 divergent** | **0** | **BEST** |
| Batched extract (8-LUT) | 13.7 | 8 divergent | 0 | Good |
| Deferred norm (4-mag) | 12.9 | 4 divergent | 0 | Lose ILP |
| 2-pair half2 + ternary | 12.0 | 2 divergent | 4 | Branches hurt |
| Select chain (zero LUT) | 11.9 | 0 | 8 | Branches kill |
| Bit-arithmetic | 11.6 | 0 | 4+ | ALU too heavy |
| Named-reg + ternary select | 10.3 | 4 uniform | 8 | Worst — ternary kills |
| Main (8-entry LUT) | 10.95 | 8 divergent | 0 | Baseline |
| Inline block (FA inner loop) | 13.5 | 4 divergent | 0 | I-cache pressure |
| Non-vec FA (nl=2) | 10.2 | 8 divergent | 0 | Wrong kernel for decode |

**Any approach that replaces constant reads with branches loses on M2.** The Apple8 GPU's branch predictor/execution is worse than its constant cache. The 4-mag LUT succeeds because it reduces constant addresses (4 vs 8) WITHOUT adding branches.

### Per-element norm multiply is faster than deferred

Deferring `float4 * norm` at the end (12.9 tok/s) is slower than per-element `v * norm` (15.1 tok/s). The per-element multiply provides ALU work that hides constant memory latency via instruction-level parallelism. The GPU can overlap the next constant read while the current multiply executes.

## Why CUDA Doesn't Have This Problem

@spiritbuun's CUDA fork achieves 97.5% of q8_0 decode. The key difference:

- **CUDA has 255 registers per thread** — cn[8] fits in registers without spilling
- **Metal has a smaller register file** — cn[8] spills to thread stack memory
- **CUDA warp semantics** — better divergent access handling across 32 threads
- **Metal simdgroup semantics** — constant cache serializes more on older chips

The register LUT approach works perfectly on CUDA but is fundamentally incompatible with Metal's register file on current hardware.

## Remaining Gap

Even with 4-mag LUT, M2 Pro decode at 8K is 15.1 vs 24.5 ceiling (62%). The remaining 38% gap needs kernel-level changes:

1. **SMEM pre-dequant**: Pre-dequant entire KV blocks into threadgroup memory before dot products. Would eliminate constant memory from the inner loop entirely. Requires modifying the FA vec kernel template.

2. **Per-group device-memory cached centroid×norm**: Store 8 pre-computed centroid×norm values per 128-element group alongside the block data. Dequant reads from device memory (sequential) instead of constant memory (divergent). Changes the block format.

3. **Fused Q·centroid attention**: Compute attention scores directly from quantized indices without full dequantization. Precompute Q·centroid table (8 values per block), then each K element is a table lookup. Complex (custom FA kernel).

These are tracked in GitHub issue #39.

## Summary

| What | Result |
|------|--------|
| Root cause | Constant memory LUT divergence, 2x worse on M2 vs M5 |
| Best fix found | 4-mag LUT (+38-45% on M2, auto-detected) |
| M5 impact | Zero regression (uses separate code path) |
| Remaining gap | 38% to ceiling, needs kernel-level surgery |
| CUDA comparison | buun gets 97.5% via register LUT (not portable to Metal) |

## Complete Experiment Log (12 approaches, 2026-03-26/27)

### M2 Pro (Apple8, has_tensor=false) — 8K context decode

| # | Approach | tok/s | vs q8_0 | vs Ceiling | Key finding |
|---|----------|-------|---------|-----------|-------------|
| — | No-op ceiling | 24.5 | 1.12x | 100% | turbo3 format FASTER than q8_0 (less bandwidth) |
| **1** | **4-mag LUT + XOR sign** | **15.1** | **0.69x** | **62%** | **Sweet spot: 4 constant addrs, 0 branches** |
| 2 | Batched byte extract (8-LUT) | 13.7 | 0.63x | 56% | Better byte reading, still 8 addresses |
| 3 | Inline block in FA loop | 13.5 | 0.62x | 55% | I-cache pressure from expanded inline |
| 4 | Deferred norm (float4 * norm) | 12.9 | 0.59x | 53% | Loses ILP — norm multiply hides LUT latency |
| 5 | 2-pair half2 + ternary | 12.0 | 0.55x | 49% | Ternary overhead exceeds 2-addr savings |
| 6 | Select chain (zero LUT) | 11.9 | 0.54x | 49% | 8 ternaries = 8 branches on Apple8 |
| 7 | Bit-arithmetic (mul+add) | 11.6 | 0.53x | 47% | 7 ALU > 4 constant reads |
| 8 | FMA branchless (zero ternary) | 11.4 | 0.52x | 47% | fma doesn't help — same ALU count |
| — | Main (8-entry constant LUT) | 10.95 | 0.50x | 45% | Baseline — 8-way divergent |
| 9 | Named-reg ternary select | 10.3 | 0.47x | 42% | 4 uniform reads + 8 branches = worst |
| 10 | Non-vec FA forced (nl=2) | 10.2 | 0.47x | 42% | Non-vec kernel wrong for single-token decode |

### M5 Max (Apple10, has_tensor=true) — short context decode

All approaches: 75-77 tok/s (M5 uses 8-LUT path, unaffected by TURBO_USE_4MAG).
No regression from any experiment. q8_0 baseline: 85 tok/s.

### Additional checks
- **Qwen2 attention bias**: NOT present in Qwen2.5-7B-Instruct model (339 tensors, no attn_q/k/v.bias). The GÖKYILDIZ bug does not apply.
- **Model-independence**: Profiling pattern (LUT = 25% cost on M2) is consistent across Qwen2.5-7B (M2 Pro) and Qwen3.5-35B-A3B (M5 Max). Architecture-independent.
- **Non-vec FA (nl=2)**: Faster on M5 (+1.7%) but much worse on M2 (-7%). The non-vec kernel processes batch=1 inefficiently at long context.

### Hardware truth (Apple8/M2 Pro)
1. **1 divergent constant read < 7 ALU ops** — even with fma()
2. **Metal compiles ternaries to branches** — not predicated moves like CUDA
3. **Branches cost MORE than divergent constant reads** — the opposite of CUDA
4. **Array indexing ALWAYS spills** — Metal's register file is too small for cn[4+]
5. **4 constant addresses is the sweet spot** — fewer adds branches, more adds thrashing
6. **Per-element norm multiply provides ILP** — hides constant memory latency

### Papers reviewed for novel approaches
- AttentionPack (arxiv 2603.23914): validates kernel fusion for attention-aware decompression
- GlowQ (arxiv 2603.25385): validates group-shared factor caching (+37.4% throughput)
- Embedding Compression via Spherical Coordinates (2602.00079): same PolarQuant framework, different encoding
- MKA: Memory-Keyed Attention (2603.20587): learned memory lookup, hardware-aware
- Scaling Attention via Feature Sparsity (2603.22300): skip negligible attention weights

### Next frontier
The 4-mag LUT is the dequant-level ceiling. The remaining 38% gap requires:
1. **Block format change**: embed precomputed centroid×norm in device memory (sequential reads, zero divergence)
2. **Custom FA kernel**: fuse dequant into attention with restructured computation
3. **Different quantization scheme**: format designed for Metal's constraints from scratch

## Extended Experiments (approaches 12-13, 2026-03-27)

| # | Approach | M2 8K tok/s | vs 4-mag | Key finding |
|---|----------|-------------|---------|-------------|
| 12 | FMA branchless (zero ternary, zero memory) | 11.4 | -24% | 7 ALU ops slower than 4 constant reads even with fma() |
| 13 | simd_shuffle cross-lane magnitude select | 14.7 | -2.6% | Shuffle latency ≈ constant read latency on Apple8 |

### FMA branchless details
Fully branchless: XOR mask via `3 - 3*sign_bit` (not ternary), sign via `2*s - 1` (not ternary), magnitude via 3-chained `fma()`. Zero branches, zero memory. Still slower because 7 ALU cycles > 1 divergent constant read cycle on Apple8.

### simd_shuffle details
Threads 0-3 within each 8-thread block compute mag[i]×norm. All threads use `simd_shuffle(value, block_base + mi)` to read the correct mag×norm. This IS branchless and memory-free — the value moves via cross-lane register transfer. But shuffle latency on Apple8 is comparable to constant cache access, negating the benefit.

### Total: 13 approaches tested
The 4-mag constant LUT at 15.1 tok/s (0.69× q8_0) is the definitive dequant-level ceiling on Apple8 hardware. Every alternative — fewer constant addresses, zero constant addresses, branchless ALU, cross-lane shuffle, inline FA blocks — is equal or worse.

## Extended Experiments (approaches 14, 2026-03-27)

| # | Approach | M2 8K tok/s | vs 4-mag | Key finding |
|---|----------|-------------|---------|-------------|
| 14 | Fused block dot (per-centroid Q accum) | 8.1 | -46% | 64 float comparisons per block — worst result of all |

### Fused block dot details
Flipped computation: instead of per-element centroid lookup, iterate over 4 centroid values and accumulate matching Q elements. Uses `float(mi == c)` as a branchless mask. But 4 centroids × 4 elements × 4 comparisons = 64 float comparison operations per dequant call. Each comparison likely compiles to a branch on Apple8, making this the most expensive approach tested.

### Final tally: 14 approaches tested

| Rank | Approach | M2 8K | vs Ceiling | Category |
|------|----------|-------|-----------|----------|
| 1 | 4-mag LUT | 15.1 | 62% | 4 constant reads |
| 2 | simd_shuffle | 14.7 | 60% | Cross-lane transfer |
| 3 | Batched extract (8-LUT) | 13.7 | 56% | 8 constant reads |
| 4 | Inline FA block | 13.5 | 55% | Inlined dequant |
| 5 | Deferred norm | 12.9 | 53% | Loses ILP |
| 6 | 2-pair half2 | 12.0 | 49% | 2 reads + ternary |
| 7 | Select chain | 11.9 | 49% | Pure branches |
| 8 | Bit-arithmetic | 11.6 | 47% | Pure ALU |
| 9 | FMA branchless | 11.4 | 47% | fma() chain |
| 10 | Main (8-LUT) | 10.95 | 45% | Baseline |
| 11 | Named-reg ternary | 10.3 | 42% | Registers + branches |
| 12 | Non-vec FA | 10.2 | 42% | Wrong kernel |
| 13 | Fused block dot | 8.1 | 33% | 64 comparisons |
| — | Ceiling (no dequant) | 24.5 | 100% | Zero overhead |

## Extended Experiments (approach 15, 2026-03-28)

| # | Approach | M2 8K tok/s | vs 4-mag | Key finding |
|---|----------|-------------|---------|-------------|
| 15 | SMEM pre-dequant (threadgroup memory tile) | 10.17 | -33% | Threadgroup store/load overhead > constant cache savings |

### SMEM pre-dequant details
Pre-dequantize entire K/V tiles (C=32 × DK=128 = 8KB half) into threadgroup memory before dot products. All 32 threads cooperatively dequant C cache positions, barrier, then compute from SMEM.

**Why it failed**: The FA vec kernel distributes work so each thread operates on unique data (DK4/NL=1 dequant per thread per cache position). Each dequanted value is used exactly once by its producer thread. Caching in SMEM adds 64 threadgroup memory ops (32 stores + 32 loads) per thread per outer iteration + barrier, for zero reduction in constant LUT reads. Additionally, separating dequant from compute destroys the ILP that makes the 4-mag LUT work (constant reads overlapped with ALU). At short context, the overhead is negligible (+1.8%), but at 8K it's catastrophic (-51.5%).

**Lesson**: SMEM only helps when data is shared between threads. Don't separate what the hardware pipelines together.

### Updated tally: 16 approaches tested

| Rank | Approach | M2 8K | vs Ceiling | Category |
|------|----------|-------|-----------|----------|
| 1 | 4-mag LUT | 15.1 | 62% | 4 constant reads |
| 2 | simd_shuffle | 14.7 | 60% | Cross-lane transfer |
| 3 | Batched extract (8-LUT) | 13.7 | 56% | 8 constant reads |
| 4 | Inline FA block | 13.5 | 55% | Inlined dequant |
| 5 | Deferred norm | 12.9 | 53% | Loses ILP |
| 6 | 2-pair half2 | 12.0 | 49% | 2 reads + ternary |
| 7 | Select chain | 11.9 | 49% | Pure branches |
| 8 | Bit-arithmetic | 11.6 | 47% | Pure ALU |
| 9 | FMA branchless | 11.4 | 47% | fma() chain |
| 10 | Main (8-LUT) | 10.95 | 45% | Baseline |
| 11 | Named-reg ternary | 10.3 | 42% | Registers + branches |
| 12 | Non-vec FA | 10.2 | 42% | Wrong kernel |
| 13 | SMEM pre-dequant | 10.17 | 41% | Threadgroup cache (ILP loss) |
| 14 | Q·centroid precompute | 10.10 | 41% | select() register LUT |
| 15 | Fused block dot | 8.1 | 33% | 64 comparisons |
| — | Ceiling (no dequant) | 24.5 | 100% | Zero overhead |

### Critical ILP insight (2026-03-28)

The 2x slowdown of SMEM pre-dequant revealed WHY the 4-mag LUT works: **interleaved constant reads and ALU provide instruction-level parallelism.** The GPU overlaps the next constant read while the current dot product + norm multiply executes. Any approach that separates memory reads from compute phases loses this overlap and runs at 2x the latency.

This explains the ranking pattern: approaches that maintain per-element `read → ALU` interleaving (4-mag, simd_shuffle, batched 8-LUT) outperform approaches that batch reads (SMEM, deferred norm, QC precompute) or eliminate reads but add more ALU (select chain, FMA, bit-arithmetic).

**The remaining 38% gap cannot be closed by rearranging the same reads and ALU.** The only path forward is reducing the total constant read count per element below 4, which requires changing the block format or computation structure.

## Extended Experiments (approaches 16h + 19, 2026-03-28 evening)

Clean benchmarks (no DINOv2 contention). Note: absolute tok/s numbers differ from earlier runs due to DINOv2 GPU contention during those measurements. Relative rankings are consistent.

### Qwen2.5-7B-Instruct-Q4_K_M (8K context, p=8192 tg128)

| # | Approach | Env Var | turbo3 t/s | turbo4 t/s | q8_0 t/s | Key finding |
|---|----------|---------|-----------|-----------|---------|-------------|
| — | Baseline (4-mag LUT) | — | 25.88 | 27.59 | 33.69 | turbo4 > turbo3 by 6.6% |
| 16h | Half register LUT (`half cn[8]`) | TURBO_HALF_REG_LUT=1 | 25.55 (-1.3%) | 27.26 (-1.2%) | — | Still spills on Apple8. Noise. |
| 19 | Threadgroup centroid cache | TURBO_TG_CENTROID=1 | 25.96 (+0.3%) | 27.42 (-0.6%) | — | Flat. Threadgroup read ≈ constant read. |

### Qwen2.5-1.5B-Instruct-Q4_K_M (8K context, p=8192 tg128)

| # | Approach | Env Var | turbo3 t/s | turbo4 t/s | q8_0 t/s | Key finding |
|---|----------|---------|-----------|-----------|---------|-------------|
| — | Baseline (4-mag LUT) | — | 51.84 | 58.74 | 113.02 | Gap worse on 1.5B (attention dominates) |
| 16h | Half register LUT | TURBO_HALF_REG_LUT=1 | 50.67 (-2.3%) | 57.76 (-1.7%) | — | Consistent regression across models |
| 19 | Threadgroup centroid cache | TURBO_TG_CENTROID=1 | 52.19 (+0.7%) | 57.98 (-1.3%) | — | Flat |

### Prompt processing (pp8192, for reference)

| Model | turbo3 | turbo4 | q8_0 |
|-------|--------|--------|------|
| 7B | 241.26 | 240.00 | 254.74 |
| 1.5B | 816.13 | 807.03 | 896.42 |

### Key takeaways from approaches 16h + 19

1. **Both approaches are noise on M2 Pro.** Half Reg LUT is consistently -1 to -2% (regression). TG Centroid is flat (within variance).

2. **turbo4 > turbo3 on M2 Pro** by 6-13% on decode. Holds across both model sizes. 16 centroids (turbo4) is faster than 8 (turbo3) — possibly better ILP with 4-bit indices.

3. **The gap to q8_0 grows with smaller models**: 30% on 7B, 118% on 1.5B. Smaller models make attention a larger fraction of decode, making the dequant overhead more visible.

4. **The centroid LUT is NOT the sole bottleneck.** Moving centroids to half-precision registers or threadgroup memory doesn't help. The bottleneck is broader — likely the full WHT transform + extraction pipeline, not just the final centroid lookup.

### Revised theory (2026-03-28)

The original theory was "constant memory LUT divergence is the bottleneck." After testing 4 approaches targeting the centroid LUT specifically (#15 SMEM pre-dequant, #16q Q·centroid precompute, #16h half reg LUT, #19 TG centroid), NONE improved decode speed. The 4-mag LUT improvement from earlier was real (+38%), but it was optimizing the full dequant pipeline (fewer reads + XOR sign trick), not just the centroid lookup.

The remaining gap is structural: turbo3/4 dequant requires WHT extraction + centroid lookup + norm multiply per element, while q8_0 is a simple `int8 * scale`. No rearrangement of the same operations will close this gap.

### Updated tally: 18 approaches tested (16h and 19 added)

| Rank | Approach | M2 8K | vs Ceiling | Category |
|------|----------|-------|-----------|----------|
| 1 | 4-mag LUT | 15.1 | 62% | 4 constant reads |
| 2 | simd_shuffle | 14.7 | 60% | Cross-lane transfer |
| 3 | Batched extract (8-LUT) | 13.7 | 56% | 8 constant reads |
| 4 | Inline FA block | 13.5 | 55% | Inlined dequant |
| 5 | Deferred norm | 12.9 | 53% | Loses ILP |
| 6 | 2-pair half2 | 12.0 | 49% | 2 reads + ternary |
| 7 | Select chain | 11.9 | 49% | Pure branches |
| 8 | Bit-arithmetic | 11.6 | 47% | Pure ALU |
| 9 | FMA branchless | 11.4 | 47% | fma() chain |
| 10 | Main (8-LUT) | 10.95 | 45% | Baseline |
| 11 | Named-reg ternary | 10.3 | 42% | Registers + branches |
| 12 | Non-vec FA | 10.2 | 42% | Wrong kernel |
| 13 | SMEM pre-dequant | 10.17 | 41% | Threadgroup cache (ILP loss) |
| 14 | Q·centroid precompute | 10.10 | 41% | select() register LUT |
| 15 | Fused block dot | 8.1 | 33% | 64 comparisons |
| 16h | Half register LUT | ~noise | ~62% | half cn[8] — still spills |
| 19 | TG centroid cache | ~noise | ~62% | Threadgroup centroid table |
| — | Ceiling (no dequant) | 24.5 | 100% | Zero overhead |

## Extended Experiments (approaches 20 + 22, 2026-03-28 night)

### Qwen2.5-1.5B-Instruct-Q4_K_M (8K context, p=8192 tg128)

| # | Approach | Env Var | turbo3 t/s | turbo4 t/s | Key finding |
|---|----------|---------|-----------|-----------|-------------|
| — | Baseline (4-mag LUT) | — | 51.80 | 58.60 | — |
| 20 | 2-bit direct encode (pure ALU, no LUT) | TURBO_DIRECT_ENCODE=1 | 52.16 (+0.7%) | 57.78 (-1.4%) | **Same speed with zero constant reads** |
| 22 | Async prefetch (batch device reads) | TURBO_ASYNC_PREFETCH=1 | 50.73 (-2.1%) | 57.02 (-2.7%) | GPU already interleaves reads optimally |

### Definitive proof: constant memory LUT is FREE on M2 Pro

Approach #20 replaced the entire centroid LUT with pure ALU (`norm * (0.25 + 0.5*idx)`) — zero constant memory reads. The result is identical speed. This means the 4 constant reads per element are hitting L1 cache and costing essentially nothing.

The bottleneck is **device memory bandwidth**: streaming 14 bytes per 32 elements (turbo3) from DRAM, strided across cache positions. q8_0 streams 34 bytes per 32 elements but with a simpler access pattern and trivial dequant (`int8 * scale`).

### Why async prefetch didn't help (#22)

Staging device memory reads (norm, qs, signs) into registers before constant reads forced a specific execution order. The GPU's out-of-order execution already interleaves these optimally. Forcing order adds register pressure without hiding latency.

### Updated tally: 20 approaches tested

| Rank | Approach | M2 8K | vs Ceiling | Category |
|------|----------|-------|-----------|----------|
| 1 | 4-mag LUT | 15.1 | 62% | 4 constant reads |
| 2 | simd_shuffle | 14.7 | 60% | Cross-lane transfer |
| 3 | Batched extract (8-LUT) | 13.7 | 56% | 8 constant reads |
| 4 | Inline FA block | 13.5 | 55% | Inlined dequant |
| 5 | Deferred norm | 12.9 | 53% | Loses ILP |
| 6 | 2-pair half2 | 12.0 | 49% | 2 reads + ternary |
| 7 | Select chain | 11.9 | 49% | Pure branches |
| 8 | Bit-arithmetic | 11.6 | 47% | Pure ALU |
| 9 | FMA branchless | 11.4 | 47% | fma() chain |
| 10 | Main (8-LUT) | 10.95 | 45% | Baseline |
| 11 | Named-reg ternary | 10.3 | 42% | Registers + branches |
| 12 | Non-vec FA | 10.2 | 42% | Wrong kernel |
| 13 | SMEM pre-dequant | 10.17 | 41% | Threadgroup cache (ILP loss) |
| 14 | Q·centroid precompute | 10.10 | 41% | select() register LUT |
| 15 | Fused block dot | 8.1 | 33% | 64 comparisons |
| 16h | Half register LUT | ~noise | ~62% | half cn[8] — still spills |
| 19 | TG centroid cache | ~noise | ~62% | Threadgroup centroid table |
| 20 | Direct encode (no LUT) | ~noise | ~62% | Pure ALU, zero constant reads |
| 22 | Async prefetch | -2% | ~61% | Forced read ordering |
| — | Ceiling (no dequant) | 24.5 | 100% | Zero overhead |

## SMEM Pre-Dequant Re-Run (2026-03-28 night, corrected)

The original -51% SMEM result was measured during DINOv2 GPU contention. Clean re-run shows it's actually a small positive:

### Qwen2.5-1.5B Q4_K_M (8K context)

| Config | SMEM t/s | Baseline t/s | Delta |
|--------|---------|-------------|-------|
| turbo3 short | 53.96 | 51.84 | **+4.1%** |
| turbo3 8K | 54.13 | 51.84 | **+4.4%** |
| turbo4 short | 60.20 | 58.74 | **+2.5%** |
| turbo4 8K | 60.37 | 58.74 | **+2.8%** |

### Qwen2.5-7B Q4_K_M (8K context)

| Config | SMEM t/s | Baseline t/s | Delta |
|--------|---------|-------------|-------|
| turbo3 8K | 26.17 | 25.88 | +1.1% |
| turbo4 8K | 26.31 | 27.59 | **-4.6%** |

### Analysis: SMEM helps small models, hurts large turbo4

The pattern is clear: SMEM pre-dequant trades constant memory pressure for threadgroup memory + barriers. This helps when:
- **Model is small** (attention is larger fraction of decode → dequant matters more)
- **KV format is turbo3** (8 centroids = more constant reads per element to avoid)

It hurts when:
- **Model is large** (register pressure from SMEM allocation → spills)
- **KV format is turbo4 on 7B** (16 centroids = larger pre-dequant tile → more SMEM overhead)

This suggests a **model-size-adaptive dispatch**: use SMEM pre-dequant for small models (≤3B), 4-mag LUT for large models (≥7B). The crossover is somewhere in between.

## SMEM Pre-Dequant on M5 Max — Small Regression (2026-03-28 night, corrected)

Tested SMEM on M5 Max (Apple10) with Qwen3.5-35B-A3B MoE. Initial results showed -77% but that was an env var bug (TURBO_SMEM_DEQUANT not set at runtime). Corrected results with env var properly set:

### Qwen3.5-35B-A3B Q8_0, M5 Max, decode (tg128 t/s)

| Context | turbo3 baseline | turbo3 SMEM | turbo4 baseline | turbo4 SMEM |
|---------|----------------|-------------|-----------------|-------------|
| short | 78.47 | 80.32 (+2.4%) | 80.40 | 77.64 (-3.4%) |
| 8K | 78.90 | 76.99 (-2.4%) | 79.84 | 78.78 (-1.3%) |
| 16K | 78.64 | 77.98 (-0.8%) | — | — |
| 32K | 78.17 | 76.18 (-2.5%) | — | — |

Note: high variance on some runs (±2.57 t/s). The one positive (turbo3 short +2.4%) is noise.

### Why SMEM doesn't help on M5

- **M5's constant cache (Apple10) is already fast** — handles 8-way divergent reads with minimal penalty
- SMEM adds threadgroup store/barrier/load overhead for data the constant cache was already serving efficiently
- Unlike M2 where SMEM helped small models +4%, M5 sees no benefit at any context depth

### Verdict: SMEM is dead

Do NOT ship SMEM pre-dequant. Small regression on M5, marginal win on M2 small models only. Not worth the complexity.

## Final Conclusion: M2 Decode Ceiling (2026-03-28)

**20 approaches tested + 1 corrected re-run. The 4-mag LUT is the M2 ceiling for large models. SMEM pre-dequant adds ~2-4% for small models.**

The bottleneck is NOT:
- Constant memory LUT divergence (proven by #20: zero LUT = same speed)
- Centroid lookup specifically (proven by #16h, #19: different storage = same speed)
- Read ordering (proven by #22: forced ordering = worse)
- Branch overhead (proven by #6-9: branchless = worse)

The bottleneck IS:
- **Device memory bandwidth** for streaming quantized KV blocks from DRAM
- **Dequant ALU complexity** (WHT extraction + centroid + norm vs q8_0's simple `int8 * scale`)
- These two costs are inherent to the turbo format and cannot be optimized away without changing the format itself
- For small models where attention dominates, SMEM can squeeze out ~4% by trading constant reads for threadgroup reads

### Remaining dequant-level approaches (low priority, likely moot)

| # | Approach | Category | Expected Impact | Status |
|---|----------|----------|----------------|--------|
| 17 | Device-memory centroid×norm | Block format change | Moot — centroid reads are free | NOT TESTED |
| 18 | Byte-indexed 256-entry LUT | LUT restructure | Moot — same reason | NOT TESTED |
| 21 | Hybrid 4-mag + simd_shuffle | Combined | Low — simd_shuffle was only -2.6% standalone | NOT TESTED |

---

## Phase 2: Kernel/Format/Pipeline-Level Optimization (2026-03-28 night)

Phase 1 (approaches 1-22) exhausted dequant-level optimization: rearranging reads, ALU, barriers, and storage within the existing FA vec kernel template. The 4-mag LUT is the dequant ceiling.

Phase 2 attacks at a higher level: different kernel shapes, different data layouts, and runtime dispatch. These are more invasive but target the actual structural gap.

### Experiment #23: Turbo-Only Non-Vec Decode Kernel

**Category:** Kernel path

**Hypothesis:** The current FA vec kernel (`kernel_flash_attn_ext_vec`) is designed for general-purpose quantized attention. A turbo-specific decode kernel built from scratch for single-token long-context generation could avoid overhead from the generic template.

**Why this might work:**
- The vec kernel's template machinery (generic dequant function pointers, parameterized NL/NSG/NE) adds register pressure and instruction cache footprint. A hand-specialized turbo3 dk128/dv128 kernel can hardcode everything.
- Earlier test of non-vec FA (approach #10, 10.2 t/s) used the EXISTING generic non-vec kernel, which is optimized for multi-token prefill, not single-token decode. A purpose-built decode kernel is a different thing entirely.
- Older Apple GPUs (Apple7/8) reward brutally specific kernels over generic templates.

**Implementation plan:**
1. Create a new kernel function `kernel_flash_attn_ext_turbo3_decode_dk128` in `ggml-metal.metal`
2. Hardcode: dk=128, dv=128, single-token (ne01=1), turbo3 block format
3. Inline the 4-mag dequant directly (no function pointer indirection)
4. Optimize the Q*K^T loop for the exact turbo3 block layout: read 14 bytes (2 bytes norm + 8 bytes qs + 4 bytes signs), extract inline, dot product inline
5. Optimize the V accumulation separately (can use different strategy than K)
6. Wire through `ggml-metal-ops.cpp` dispatch: use this kernel when `type_k == TURBO3_0 && ne01 == 1 && dk == 128`
7. Gate behind `TURBO_SPECIALIZED_DECODE=1` env var

**Key code touchpoints:**
- `ggml/src/ggml-metal/ggml-metal.metal` — new kernel function + instantiation
- `ggml/src/ggml-metal/ggml-metal-ops.cpp` — dispatch logic to select this kernel
- `ggml/src/ggml-metal/ggml-metal-device.m` — env var wiring

**Success criteria:**
- Any improvement over 25.88 t/s (turbo3 7B 8K baseline) is a win
- 30+ t/s would close the gap to q8_0 (33.69) significantly
- Regression at short context is acceptable if long context improves

**Risk:** High effort, uncertain payoff. The template overhead might not be the bottleneck.

**Result: NEGATIVE.** Metal shader compiler already fully specializes templates — function pointers resolved, dimensions constant-folded, dead code eliminated. No runtime overhead to remove.

| Model | Baseline t/s | Specialized t/s | Delta |
|-------|-------------|----------------|-------|
| 1.5B short | 53.34 | 52.77 | -1.1% |
| 1.5B 8K | 51.73 | 51.44 | -0.6% |
| 7B 8K | 25.89 | 25.83 | -0.2% (noise) |

The slight regression is likely from increased instruction footprint — duplicated inline dequant in K and V loops vs the compiler sharing code between template-generated paths. **The generic template IS the optimized kernel.**

---

### Experiment #24: Apple7/8-Specific Alternate Block Format

**Category:** Data layout

**Hypothesis:** The turbo3 block format was designed for the algorithm, not for Apple8 GPU memory access patterns. A second on-device KV format optimized for how the M2 FA vec kernel actually consumes data could reduce memory stalls.

**Why this might work:**
- Current turbo3 block layout: `[norm(2B)] [qs(8B)] [signs(4B)]` = 14 bytes per 32 elements. The kernel reads these in 3 separate device memory accesses at different offsets within the block.
- If we interleave the data by 4-element consumption order (matching the vec kernel's `DK4` stride), each fetch brings exactly what the next compute step needs — no wasted bandwidth, better prefetch prediction.
- The profiling data shows "reading bytes" costs ~10% of ceiling on both M2 and M5. Reducing this by even half could be meaningful when combined with other improvements.

**Implementation plan:**
1. Design an alternate block layout `block_turbo3_m2` where data is arranged by the vec kernel's consumption order:
   - For each 4-element group: `[norm_chunk] [qs_chunk] [sign_bits]` contiguous
   - Or: per-8-element granule matching SIMD consumption width
2. Add a SET_ROWS variant that packs into the new layout during KV cache write
3. Add a `dequantize_turbo3_m2_t4` that reads the new layout
4. Wire as a new kernel instantiation
5. Gate behind `TURBO_M2_FORMAT=1` env var

**Key code touchpoints:**
- `ggml/src/ggml-common.h` — new block struct definition
- `ggml/src/ggml-metal/ggml-metal.metal` — new dequant function + kernel instantiation
- `ggml/src/ggml-metal/ggml-metal-ops.cpp` — SET_ROWS + dispatch for new type
- `ggml/src/ggml-metal/ggml-metal-device.m` — env var

**Success criteria:**
- Measurable improvement in the "read bytes" profiling mode (mode 4 → mode 3 gap)
- Any decode speed improvement over 4-mag baseline
- Must not regress PPL (same data, different packing)

**Risk:** Medium effort, medium payoff. The 10% "read bytes" cost is real but may already be well-hidden by the dequant ALU.

---

### Experiment #25: Dual Runtime Dispatch by Chip Family + Context Depth

**Category:** Pipeline

**Hypothesis:** No single kernel configuration is optimal across all context depths. A runtime dispatch that selects different strategies based on context depth could win overall.

**Why this might work:**
- Already proven: 4-mag helps at 16K on M5 (+2.4%) but hurts at 32K (-7.3%). Crossover at ~20K.
- turbo3 no-dequant is 12% faster than q8_0 at short context (bandwidth advantage) but 38% slower at 8K (dequant overhead dominates).
- The vec kernel has `NL` parameter controlling simdgroup work distribution. Different NL values may be optimal at different context depths.

**Implementation plan:**
1. Compile both 4-mag and 8-LUT FA kernel instantiations for turbo3/turbo4 on pre-M5
2. In `ggml-metal-ops.cpp`, add context-depth dispatch:
   - Pre-M5 + context < threshold: one kernel config
   - Pre-M5 + context ≥ threshold: different kernel config
3. Profile to find the optimal crossover point
4. Also test different `NL` values (currently NL=1 for dk128) at different context depths
5. Gate behind `TURBO_CONTEXT_DISPATCH=1` env var

**Key code touchpoints:**
- `ggml/src/ggml-metal/ggml-metal-ops.cpp` — dispatch logic (primary)
- `ggml/src/ggml-metal/ggml-metal.metal` — may need additional kernel instantiations with different NL
- `ggml/src/ggml-metal/ggml-metal-device.m` — env var

**Success criteria:**
- Better decode across the full context range (short through 32K) than any single configuration
- Minimal code complexity (dispatch is a few lines in ops.cpp)

**Risk:** Low effort, medium payoff. The gains from 4-mag vs 8-LUT switching were small on M5 (+2.4% at one depth). On M2 where the gap is larger, the payoff might be bigger.

**Result: NEGATIVE on 1.5B.** turbo3/turbo4 decode is completely flat across all context depths (~52/~59 t/s). q8_0 gets faster at long context (48 → 109 t/s) because attention becomes a larger fraction of decode. No crossover point to dispatch on.

| Config | short | 2K | 4K | 8K | 16K |
|--------|-------|-----|-----|------|------|
| f16 (ceiling) | 119.60 | — | — | 116.22 | 116.14 |
| q8_0 | ~48 | 64 | ~82 | 107 | 109 |
| turbo4 | 59.29 | 59.02 | 59.00 | 60.65 | 59.03 |
| turbo3 (4-mag) | 52.20 | 51.94 | 52.36 | 51.79 | 51.86 |
| turbo3 (zero dequant) | 55.68 | 55.87 | 55.83 | 55.42 | 55.69 |

Additional findings:
- Zero-dequant gap is only 7% on 1.5B — FFN matmuls dominate, not dequant
- Asymmetric K/V (mixed types) not supported — no mixed-type kernel templates
- turbo2 not registered in this build — needs update
- Needs retesting on larger model (7B+) where attention is a bigger fraction of decode

---

### Priority order for Phase 2

1. **#23 Turbo-only non-vec decode kernel** — highest potential, directly attacks the structural gap. Start here.
2. **#25 Dual runtime dispatch** — low effort, can run in parallel with #23 as a quick profiling exercise.
3. **#24 Alternate block format** — medium effort, save for after #23 results inform whether the bottleneck is compute or memory layout.

### Additional ideas from full brainstorm (backlog, not prioritized)

These are logged for future reference. Each could become a numbered experiment if the top 3 don't pan out.

**Kernel-path:**
- Lane-specialized dequant microkernels (hardcode per dk/dv pair)
- Split K/V strategies (different kernel for score vs accumulation)
- Metadata-first K path (stage norm+bits, defer centroid materialization)
- Two-token decode kernel (microbatch=2 for better utilization)
- Persistent-thread decode kernel (keep threadgroups resident across tiles)

**Computation:**
- Norm hoisting at higher granularity (fused partial accumulation)
- Affine centroid approximation (approximate arithmetic + correction term)
- Top-2 magnitude encode (cheap primary path + rare correction)
- Softmax-side pruning before full V work (conservative early exit)
- Mixed-precision accumulation schedule (audit half/float widen points)

**Instrumentation:**
- Per-section timing inside Metal kernel (finer than current profiling modes)
- Dimension-specific perf matrix (128 vs 192 vs 256 on Apple8)
- Instruction-cache pressure audit (code-size-reduced kernel variants)

## M5 Max Long-Context Discovery (2026-03-27)

### The constant cache bottleneck hits M5 Max too at long context

| Depth | Ceiling | Reads only | Full dequant | q8_0 | LUT cost | Ceiling vs q8_0 |
|-------|---------|-----------|-------------|------|----------|----------------|
| 8K | 78.9 | 75.2 | 59.2 | 78.8 | 20% | 1.00x |
| 16K | 75.9 | 74.7 | 58.7 | 72.0 | 21% | 1.05x |
| 32K | 78.3 | 71.9 | 47.1 | 61.0 | **34%** | **1.28x** |

**turbo3 with zero dequant is 28% FASTER than q8_0 at 32K on M5 Max.** The compressed cache bandwidth advantage grows with context. The LUT cost explodes from 20% to 34% as context grows.

### 4-mag vs 8-LUT on M5 Max across context depths

| Depth | q8_0 | 8-LUT | 4-mag | 4-mag vs 8-LUT |
|-------|------|-------|-------|----------------|
| short | 85.0 | 76.7 | 76.2 | -0.7% |
| 16K | 72.0 | 58.9 | **60.3** | **+2.4%** |
| 32K | 61.0 | 47.6 | 44.1 | -7.3% |

4-mag helps at 16K (+2.4%) but hurts at 32K (-7.3%) on M5. Crossover around 20K.

### Context-adaptive dispatch (planned)
Compile both 4-mag and 8-LUT FA kernel instantiations. At dispatch time, select based on KV cache size:
- Pre-M5 (no tensor API): always 4-mag
- M5+ with context < 8K: 8-LUT (minimal cache pressure)
- M5+ with 8K-20K context: 4-mag (moderate pressure, 4-mag helps)
- M5+ with context > 20K: 8-LUT (fully thrashed, ALU overhead dominates)

### The real prize
If we can reduce dequant cost from 34% to ~10% at 32K, turbo3 decode would be **FASTER than q8_0** at long context. The bandwidth advantage (28% at 32K) far exceeds the dequant overhead — we just need to close the gap. This flips the narrative from "turbo3 is slower but smaller" to "turbo3 is faster AND smaller."