# TurboQuant Speed Investigation

## Problem Statement

TurboQuant generates at 2.4 tok/s vs 85.0 tok/s for q8_0 on Qwen 3.5 35B-A3B MoE (M5 Max).
That's a 35× regression. Compression target (4.9×) is met, but speed makes it impractical.

## Root Cause Analysis

### Why it's slow

The flash attention kernel calls `dequantize_turbo3_0(block_ptr, il, &reg)` once per chunk:
- `type4x4` path: 16 elements per call, 128/16 = **8 calls per block**
- `type4` path (flash_attn_ext_vec): 4 elements per call, 128/4 = **32 calls per block**

Each call does the FULL 128-element dequantize:
1. Unpack 128 indices from packed bits → centroid lookup
2. Inverse WHT rotation (7 butterfly passes × 128 = 896 ops)
3. Unpack 128 QJL signs
4. Inverse WHT rotation on signs (another 896 ops)
5. Combine and scale

Total per call: ~2000 ops + 256 bytes stack allocation
Total per block: 8-32 × 2000 = **16,000-64,000 ops** (should be 2000)

### Comparison with q8_0

q8_0 dequantize: `x[i] = qs[i] * d` — 1 multiply per element, no stack, no rotation.
Per block (32 elements): 32 ops. Per 128 elements: 128 ops.

TurboQuant is doing 64,000 ops where q8_0 does 128. That's **500× more compute per block**.

Even with WHT (18× less than dense matvec), we're at 64,000/18 ≈ 3,500 ops vs 128.
That's still **27× more compute**, close to the measured 35× regression.

The extra ~8× gap is from stack allocation overhead (128-float arrays created/destroyed
32× per block) and memory bandwidth (reading the constant sign arrays 32× per block).

---

## Investigation Plan

### Approach A: Modify flash attention to dequantize once per block
- [ ] **A1**: Study the flash attention kernel template structure
  - Find where `deq_k` and `deq_v` are called
  - Understand the iteration pattern: which blocks, which chunks
  - Identify where to insert a "pre-dequantize" step
- [ ] **A2**: Add threadgroup memory buffer for pre-dequantized blocks
  - Allocate `threadgroup float turbo_deq_buf[128]` in flash attention kernel
  - Before the per-chunk loop, call `turbo3_dequantize_full_block()` once into this buffer
  - Replace per-chunk dequantize calls with reads from the buffer
- [ ] **A3**: Test with turbo-specific flash attention kernel instantiation
  - The generic template uses function pointers for dequantize
  - May need a specialized kernel that bypasses the per-chunk pattern
- [ ] **A4**: Benchmark after each change

### Approach B: Custom flash attention kernel for turbo types
- [ ] **B1**: Fork `kernel_flash_attn_ext_vec` into `kernel_flash_attn_ext_vec_turbo`
  - Remove the generic dequantize function pointer
  - Inline turbo-specific dequantize at the block level
  - Use threadgroup memory for the dequantized block
- [ ] **B2**: Replace per-chunk reads with direct buffer indexing
  - Instead of `deq_k(pk4x4 + block_idx, chunk_idx, tmp)`, do `tmp = buf[chunk_idx * 16 : (chunk_idx+1) * 16]`
- [ ] **B3**: Register the custom kernel in pipeline lookup
- [ ] **B4**: Benchmark

### Approach C: Restructure dequantize to amortize across chunks
- [ ] **C1**: Change the dequantize API to accept a pre-allocated buffer
  - `dequantize_turbo3_0(block_ptr, il, &reg, thread float * shared_buf)`
  - On first call (il==0), fill the buffer; on subsequent calls, read from it
  - Problem: can't change the function signature without changing all callers
- [ ] **C2**: Use the `nl_k` parameter differently
  - Currently nl_k=32 for turbo3 (128/4). What if we set nl_k=1 and return all 128 at once?
  - Would need the caller to handle 128-element chunks
  - Probably breaks the template assumptions

### Approach D: Reduce per-call overhead without architecture changes
- [ ] **D1**: Precompute the WHT butterfly as a lookup table
  - Instead of 7 butterfly passes, use a precomputed 128-element permutation
  - Trade memory for compute: 128 × 4 bytes = 512 bytes constant
  - Won't help much since WHT is already fast
- [ ] **D2**: Pack the dequantize tighter — reduce stack allocations
  - Merge the centroid lookup + WHT into a single pass
  - Avoid allocating separate `recon[128]` and `signs_f[128]` — interleave
- [ ] **D3**: Use half precision for intermediate calculations
  - `half` arithmetic is 2× faster on Apple Silicon
  - May reduce quality slightly but worth testing

---

## Expected Outcomes

| Approach | Expected Speedup | Effort | Risk |
|----------|-----------------|--------|------|
| A (modify kernel) | 8-32× (eliminate redundant calls) | Medium | Medium — need to understand kernel internals |
| B (custom kernel) | 8-32× + optimal memory access | High | Low — clean separation |
| C (restructure API) | 8-32× | Low | High — may break template |
| D (reduce overhead) | 2-3× | Low | Low |

**Recommended order**: D first (quick wins), then A or B (the real fix).

Target: D → 5-8 tok/s, then A/B → 20-40 tok/s.

---

## Progress Log

### 2026-03-25: Initial investigation
- Dense matvec: 2.4 tok/s (35× slower than q8_0)
- WHT rotation: 2.4 tok/s (same — bottleneck is redundant calls, not per-call compute)
- Root cause confirmed: dequantize called 8-32× per block by flash attention
- Codex + roast reviewed WHT implementation: correct, no bugs

### Next: Start with Approach D (reduce per-call overhead)

### 2026-03-25: simd_broadcast attempt
- Added simd_broadcast fast path for K and V dequant (nl_k==32 && DK==128)
- Thread 0 dequantizes, broadcasts 128 floats via simd_broadcast loop
- **Result: still 2.4 tok/s** — the 128-iteration simd_broadcast loop per cc iteration
  is itself expensive. 32 cc iterations × 128 broadcasts = 4096 simd_broadcast calls per block.
- Codex review caught: DK>128 OOB bug (fixed), turbo4 using turbo3 dequant (fixed),
  uninitialized turbo_buf on non-lane-0 (fixed with zero-init)
- **Conclusion**: simd_broadcast is wrong tool. Need threadgroup memory instead.

### Next: try threadgroup memory approach
- Allocate extra threadgroup memory in FATTN_SMEM
- One thread writes 128 floats to threadgroup, barrier, all threads read
- This reduces to 1 dequant + 1 barrier per cc iteration instead of 128 broadcasts

### 2026-03-25: threadgroup memory attempt
- Replaced simd_broadcast with threadgroup memory + simdgroup_barrier
- Thread 0 dequantizes into threadgroup, barrier, all threads read
- **Result: still 2.4 tok/s**
- Eliminating 31/32 redundant dequant calls had NO effect on speed
- This means the dequant cost itself (even 1× per block) is NOT the bottleneck
- Or the bottleneck is elsewhere entirely (SET_ROWS quantize? block size overhead?)

### Hypothesis: block size 128 vs 32 causes structural overhead
- q8_0 block size = 32, turbo block size = 128
- The flash attention kernel processes DK4/NL elements per thread per cc iteration
- For q8_0: DK4/NL = 32/8 = 4 iterations (inner ii loop runs 4×)
- For turbo: DK4/NL = 32/32 = 1 iteration (inner ii loop runs 1×)
- But NL = 32 for turbo vs NL = 4 for q8_0 (32/8=4, C=32, NE=1→NW/NE=32 for both)
- Actually NL = NW/NE = 32/1 = 32 for both... so DK4/NL should be the same?
- Wait: for q8_0, nl_k=8 (32 elements / 4 per t4 = 8 chunks). DK4 = 128/4 = 32. DK4/NL = 32/32 = 1.
- So BOTH q8_0 and turbo have DK4/NL = 1 iteration in the inner loop.
- The only difference is the dequant function itself.

### Next: profile whether the bottleneck is in dequant or elsewhere
- Test with a no-op dequant (return zeros) to measure the kernel overhead
- If still slow → bottleneck is NOT dequant, it's structural

### 2026-03-25: NO-OP DEQUANT TEST — CRITICAL FINDING
- Made dequantize_turbo3_0 and dequantize_turbo3_0_t4 return all zeros (no-op)
- **Result: STILL 2.4 tok/s**
- **The dequant function is NOT the bottleneck.**
- Even with zero-cost dequant, turbo3 is 35× slower than q8_0.
- The bottleneck is ELSEWHERE:
  1. SET_ROWS (quantize) path — 128×128 WHT per block during KV cache write
  2. Kernel dispatch overhead for 128-element blocks
  3. Memory layout/alignment issues with the larger block structure
  4. The quantize kernel (kernel_set_rows_turbo) doing full WHT per KV write

### New investigation target: SET_ROWS (quantize path)
- The quantize_turbo3_0 function in the Metal shader does:
  1. Normalize (128 ops)
  2. Copy to working buffer (128 ops)  
  3. Forward WHT rotation (896 ops)
  4. Centroid lookup + pack (128 ops)
  5. Inverse WHT rotation (896 ops)
  6. Residual (128 ops)
  7. Forward QJL WHT rotation (896 ops)
  8. Sign pack (128 ops)
  Total: ~3200 ops per block in the QUANTIZE path
- This runs for every token at every layer at every head
- For Qwen 3.5 MoE: 28 layers × 8 heads = 224 quantize calls per generated token
- Each quantize does 3200 ops = 716,800 ops per token just for KV cache write

### Next: profile SET_ROWS path / try no-op quantize test

### 2026-03-25: THE BUG — #include caused CPU fallback!!! 🎉
- `#include "turbo-wht.h"` in ggml-metal.metal causes Metal JIT to fail
- The model SILENTLY falls back to CPU for all operations
- ALL previous benchmarks (2.4 tok/s) were measuring CPU, not Metal GPU
- **After inlining turbo-wht.h:**
  - MoE: 2.4 → 10.7 tok/s gen (4.5× improvement, now on Metal)
  - Qwopus: 2.4 → 5.3 tok/s gen (2.2× improvement)
  - Prompt speeds: 4× → 60-67 tok/s (MoE), 3.5 → 30 tok/s (Qwopus)
- Remaining gap vs q8_0: 8× (MoE), 3.3× (Qwopus) — down from fake 35×
- **Rule: NEVER use #include in ggml-metal.metal — always inline**
- This is the same bug we hit with turbo-matrices.h earlier and forgot to apply to WHT

### Real benchmarks (proper Metal GPU):

| Model | Cache | Prompt | Gen | Compression | vs q8_0 |
|-------|-------|--------|-----|-------------|---------|
| MoE | q8_0 | 222.8 | 85.5 | 2.0× | 1.00× |
| MoE | turbo3 | 67.3 | 10.7 | 4.9× | 0.13× |
| Qwopus | q8_0 | 83.1 | 17.6 | 2.0× | 1.00× |
| Qwopus | turbo3 | 29.8 | 5.3 | 4.9× | 0.30× |

### Remaining optimization targets (ACTUAL, not fake):
- 8× gap on MoE gen (85.5 vs 10.7)
- 3.3× gap on Qwopus gen (17.6 vs 5.3)
- WHT rotation is O(d log d) per block — already optimized
- The per-chunk redundant call pattern still wastes ~8× on dequant
- BUT: no-op dequant test was on CPU, need to redo on Metal

### 2026-03-25: Upstream research findings
- TurboQuant feature request exists upstream: ggml-org/llama.cpp#20977
- mudler has experimental fork — should compare notes
- New quant type contribution guidelines require perplexity + KL divergence data
- The CPU fallback pattern is known: MXFP4 hit the same issue (PR #20609)
- Metal Tensor API optimization (PR #20962) could give us another ~26% on mul_mat
- Our current 8× gap on MoE / 3.3× on Dense is REAL Metal overhead, not a bug
- The WHT rotation cost per block is the genuine bottleneck now

### Summary of investigation
| Test | Gen tok/s | What we learned |
|------|-----------|-----------------|
| q8_0 baseline | 85.5 | Target speed |
| Dense matvec rotation | 2.4 | Was actually CPU fallback |
| WHT rotation | 2.4 | Still CPU fallback |
| simd_broadcast opt | 2.4 | Still CPU fallback |
| threadgroup opt | 2.4 | Still CPU fallback |
| No-op dequant | 2.4 | Still CPU fallback |
| No-op quantize | 2.4 | Still CPU fallback |
| **Inlined WHT (proper Metal)** | **10.7** | **THE FIX — #include caused CPU fallback** |

Rule: NEVER use #include in ggml-metal.metal. Always inline.

### 2026-03-25: Upstream competitive intelligence — GAME CHANGERS

**Finding 1: QJL residual kills quality when naively re-added**
Dejan.ai found cosine similarity drops to 0.69 with naive QJL add-back.
Must use MSE-only for drop-in replacement, or fused kernel for QJL.

**Finding 2: Pre-rotate queries, not keys (THE BIG WIN)**
`<q, R^T * c[idx]> = <R*q, c[idx]>`
Rotate query ONCE, then dequant is just a centroid table lookup.
No rotation in the hot dequant path. Would make turbo3 as fast as q4_0.

**Finding 3: WHT abandoned by everyone**
Both Dejan.ai and mudler tried WHT, both switched to dense QR.
But with pre-rotate-queries, rotation method doesn't matter.

**Finding 4: unixsysdev gets -0.8% speed loss (CUDA, block=32)**
Pure polar coordinate encoding, no random rotation, fused kernel.

**Finding 5: We're the only Metal implementation**
Everyone else is CPU-only or CUDA. Our Metal kernels are unique.

### Immediate action items:
- [ ] Test MSE-only (no QJL) — simplify dequant, may improve quality + speed
- [ ] Implement pre-rotate-queries — eliminates rotation from dequant entirely
- [ ] Consider block size 32 for better GPU parallelism

### 2026-03-25: Speed ceiling test — no rotation in dequant (proper Metal)
- Removed turbo_rotate_inverse from dequant (quality broken but measures ceiling)
- **Result: 49.1 tok/s gen (4.6× improvement over 10.7)**
- Prompt: 162.6 tok/s (2.4× improvement over 67.3)
- This is 57% of q8_0 speed (85.5 tok/s)
- Confirms: removing rotation from dequant via pre-rotate-queries
  would get us from 10.7 → ~49 tok/s
- Remaining gap (49 vs 85) is from block size 128 + QJL overhead
- REVERTED change after measurement

### 2026-03-25: PRE-ROTATE-QUERIES — THE WIN 🎉
- Implemented: store R^T in KV cache, apply ggml_mul_mat(R_T, q) in build_attn_mha
- Stripped turbo_rotate_inverse from Metal dequant
- Codex caught: buffer_clear zeroing rotation after init (fixed)
- **MoE gen: 10.7 → 51.4 tok/s (4.8× speedup, 60% of q8_0)**
- **Qwopus gen: 5.3 → 14.6 tok/s (2.8× speedup, 83% of q8_0)**
- Prompt: 67 → 160 tok/s (MoE), 30 → 68 tok/s (Qwopus)

### FINAL SUMMARY

| Stage | MoE tok/s | Qwopus tok/s | What happened |
|-------|-----------|-------------|---------------|
| First benchmark | 2.4 | 1.3 | CPU fallback (#include bug) |
| Fixed #include | 10.7 | 5.3 | Real Metal, WHT rotation overhead |
| Speed ceiling test | 49.1 | — | Confirmed rotation is the bottleneck |
| **Pre-rotate-queries** | **51.4** | **14.6** | **Rotation moved from dequant to Q** |
| q8_0 baseline | 85.5 | 17.6 | Target |

**Total improvement: 21× from first benchmark to final (2.4 → 51.4 tok/s)**
**Compression: 4.9× throughout**

### Key lessons learned
1. NEVER use #include in ggml-metal.metal — causes silent CPU fallback
2. Always verify Metal library loads successfully before benchmarking
3. The no-op test (set function to return zeros) is the fastest way to isolate bottlenecks
4. Pre-rotate-queries (from Dejan.ai) is the right architectural approach
5. Codex + roast reviews catch real bugs (buffer clear ordering, stale code, MSL limitations)
6. The ggml_mul_mat approach for Q rotation is clean and correct

### 2026-03-25: CHANGE 1 — Drop QJL (MSE-only) ✅
- Python validation: cosine 0.9508 → 0.9831, better on 99.3% of real KV vectors
- MoE gen: 51.4 → 62.2 tok/s (73% of q8_0)
- Qwopus gen: 14.6 → 15.5 tok/s (88% of q8_0)
- Qwopus prompt: 83.1 tok/s = 100% of q8_0!

### 2026-03-25: CHANGE 2 — WHT for Q rotation — NOT NEEDED
- Speed ceiling test: removing Q rotation entirely gives 61.3 tok/s (vs 62.2 with it)
- The dense 128×128 ggml_mul_mat adds <1% overhead — negligible on Metal
- The remaining gap (62.2 vs 85.5 on MoE) is structural:
  - Block size 128 (4 blocks per head) vs q8_0 block 32 (4 blocks per head too, but simpler dequant)
  - 128 centroid lookups (3-bit unpack + table) vs 32 byte * scale (q8_0)
  - This is the irreducible cost of the algorithm
  
### FINAL PERFORMANCE

| Model | q8_0 | turbo3 MSE-only | Speed | Compression |
|-------|------|-----------------|-------|-------------|
| MoE 35B | 85.5 tok/s | **62.2 tok/s** | **73%** | **4.9×** |
| Qwopus 27B | 17.6 tok/s | **15.5 tok/s** | **88%** | **4.9×** |

### 2026-03-25: Block size diagnostic — THE ANSWER

| Cache | Block | Gen tok/s | vs q8_0 |
|-------|-------|-----------|---------|
| q8_0 | 32 | 84.1 | 1.00× |
| q4_0 | 32 | 84.2 | 1.00× |
| q4_1 | 32 | 84.6 | 1.01× |
| q5_0 | 32 | 80.6 | 0.96× |
| turbo3 | 128 | 62.5 | 0.74× |
| turbo4 | 128 | 42.8 | 0.51× |

CONCLUSION: q4_0 (block 32, 4-bit) runs at 100% of q8_0 speed.
The 26% gap on turbo3 is ENTIRELY from block size 128.
The quantization math (centroid lookup) adds zero overhead at block 32.

Next: implement block size 32 variant of turbo3.
Challenge: rotation operates on head_dim=128, but blocks are 32.
Solution: quantize 128 elements with rotation, store as 4×32 blocks.
Dequant reads 32-element blocks without rotation (pre-rotate-queries).

### 2026-03-25: BLOCK SIZE 32 — TARGET EXCEEDED 🎉

| Model | q8_0 | turbo3 block32 | vs q8_0 | Compression |
|-------|------|----------------|---------|-------------|
| MoE prompt | 222.8 | **218.5** | **98%** | 4.6× |
| MoE gen | 85.5 | **77.7** | **91%** | 4.6× |
| Qwopus prompt | 83.1 | **89.5** | **108%** | 4.6× |
| Qwopus gen | 17.6 | **17.0** | **97%** | 4.6× |

Target was 75+ tok/s on MoE. Hit 77.7. Qwopus prompt is FASTER than q8_0.

Codex post-commit review found 3 bugs (filed #29):
- TURBO_D was QK_TURBO3 (broke turbo4 C code) — fixed
- SET_ROWS kernel turbo3-specific but instantiated for turbo4 — tracked
- Tail block drop for non-128 head dims — tracked

### COMPLETE JOURNEY

| Optimization | MoE gen | vs q8_0 |
|-------------|---------|---------|
| CPU fallback (bug) | 2.4 | 3% |
| Real Metal | 10.7 | 13% |
| Pre-rotate-queries | 51.4 | 60% |
| MSE-only (drop QJL) | 62.2 | 73% |
| **Block size 32** | **77.7** | **91%** |
| q8_0 baseline | 85.5 | 100% |

**32× total improvement. 4.6× compression at 91-97% speed.**

## Pre-Rotate-Queries Investigation (2026-03-25)

Attempted to move WHT inverse rotation from dequant to graph-level ops. Goal: eliminate per-block O(128) rotation, reclaiming speed from ~10.7 to ~77 tok/s.

**Result: FAILED.** Graph-side rotation gives PPL 23.5 vs dequant-side 6.19. Root cause unknown despite extensive debugging. See `docs/pre-rotate-queries-investigation.md` for full test matrix.

**Key findings:**
- ggml_mul_mat(A, x) with row-major stored A gives A @ x (verified with 2x2 test)
- Both Q rotation and V inverse work mechanically (verified via isolation tests)
- Rotation matrices verified correct (R^T @ R = I, matches Metal WHT signs)
- Model head_dim = 256 with 128-element rotation groups (2 groups per head)
- Codex caught: `clear(true)` zeros rotation tensors without reinit (fixed)
- Codex caught: header comments had wrong matrix orientation (fixed)

**Previous state:** Dequant inverse rotation restored. PPL = 6.194. Speed = ~10.7 tok/s.

**Root cause found (Gemini 3 Pro, later disproven):** Initially thought WHT/RoPE don't commute. Actual fix: corrected ggml column-major storage orientation for rotation matrices, then applied WHT in build_attn after RoPE.

## Speed Optimization Session (2026-03-25) — 739 → 2747 tok/s

### Step 1: FP16 WHT (739 → 1074, +45%)
Switched WHT butterfly from fp32 to fp16. Centroids fit in fp16 (max |val| = 0.19).

### Step 2: Half4 Vectorized Butterfly (1074 → 1411, +31%)
Rewrote WHT to operate on 32 x half4 vectors. Stages h=1,2 use intra-vector swizzle. Stages h=4+ use inter-vector exchange with computed stride.

### Step 3: Graph-Side WHT Rotation (1411 → 2095, +48%)
Moved WHT from per-block dequant to graph-level ggml_mul_mat. Q rotation in build_attn (after RoPE, before build_attn_mha). V un-rotation after build_attn_mha. Dequant becomes simple centroid lookup.

### Step 4: Block-32 Storage (2095 → 2747, +31%)
With WHT removed from dequant, reduced block size from 128 to 32. Matches q4_0 GPU parallelism. nl drops from 8 to 2 (non-vec) and 32 to 8 (vec).

### Final Results (Qwen3.5-35B-A3B, wikitext-2, 32 chunks)

| Config | Prefill tok/s | vs q8_0 | PPL |
|--------|-------------|---------|-----|
| q8_0 baseline | 2694 | 1.00x | 5.414 |
| **turbo3 (block-32 + graph WHT)** | **2747** | **1.02x** | **5.460** |

**q8_0 speed parity. 4.6x compression. 0.8% quality loss. 3.72x total speedup.**