# Benchmarks

All measurements on the indicated arch with the engine's default config
(asym3 KV, FlashAttention auto, prompt_normalize=on). Numbers are
medians across 5 runs unless noted. See
[methodology/perf-benchmarking.md](methodology/perf-benchmarking.md) for
the protocol and the noise band you should expect when reproducing.

## Autoregressive decode (no spec) — 7900 XTX (gfx1100)

| Model | decode | prefill (peak) | effective BW |
|---|---:|---:|---:|
| Qwen 3.5 0.8B MQ4 | **391 tok/s** | **7383 tok/s** | 200 GiB/s |
| Qwen 3.5 4B MQ4 | **180 tok/s** | **2487 tok/s** | 433 GiB/s |
| Qwen 3.5 9B MQ4 | **132 tok/s** | **1663 tok/s** | **654 GiB/s** |
| Qwen 3.5 27B MQ4 | **47 tok/s** | **478 tok/s** | **651 GiB/s** |

9B and 27B decode saturate ~650 GiB/s of the 7900 XTX's 960 GB/s peak
(68% BW-efficient end-to-end across weights + KV + activations).
Prefill on the smaller sizes is WMMA-bound on the MQ4 fused
projections.

## DFlash speculative decode by genre — 7900 XTX

DFlash speedup is **genre-conditional**. Code prompts whose target
distribution matches the draft win big; long-form prose where the
target's high-entropy continuations diverge from draft predictions can
be a net loss.

5-run medians, asym3 KV, `--no-chatml`, `max_tokens=120`,
`prompt_normalize=true`:

| Model | genre | AR tok/s | DFlash tok/s | speedup | τ |
|---|---|---:|---:|---:|---:|
| Qwen 3.5 27B | code (HumanEval/53) | 44.1 | **196.0** (peak 218.6) | **4.45×** | 9.82 |
| Qwen 3.5 27B | prose (Rome essay) | 44.0 | 49.6 | 1.13× | 1.67 |
| Qwen 3.5 27B | instruct (sky-color) | 44.6 | 44.7 | 1.00× | 1.39 |
| Qwen 3.5 9B | code (HumanEval/53) | 124.0 | **329.1** (peak 346.7) | **2.65×** | 6.76 |
| Qwen 3.5 9B | code (HumanEval/0) | 121.9 | **372.9** | **3.06×** | 8.23 |
| Qwen 3.5 9B | instruct (sky-color) | 124.4 | **246.9** | **1.99×** | 4.76 |
| Qwen 3.5 9B | prose (federalist) | **125.3** | 99.4 | 0.79× ✗ | 1.20 |
| Qwen 3.5 9B | prose (Rome) | **122.7** | 98.3 | 0.80× ✗ | 1.20 |
| Qwen 3.6 27B | code (HumanEval/53) | 44.2 | **185.5** | **4.19×** | 9.25 |

**Default `dflash_mode=off`** as of v0.1.8 — DFlash is opt-in until
the genre-conditional speedup is more universally a win. Enable it
globally with `hipfire config set dflash_mode auto` (the engine then
turns DFlash on for dense Qwen 3.5+ targets and off where it
historically loses) or per model with `hipfire config qwen3.5:27b set
dflash_mode on`. The numbers above were measured with DFlash forced
on.

## vs ollama (Q4_K_M GGUF) — 7900 XTX

Same machine, same models. hipfire MQ4 (asym3 KV, FlashAttention) vs
ollama default Q4_K_M through llama.cpp's ROCm backend. Matched
~140-token and ~530-token prompts and matched 128-token generation
lengths. Ollama numbers extracted from its own `prompt_eval_duration` /
`eval_duration` reporting via `/api/generate` with `num_predict=128`.

| Model | hf pp128 | oll pp128 | hf pp512 | oll pp512 | hf decode | oll decode | decode× |
|---|---:|---:|---:|---:|---:|---:|---:|
| Qwen 3.5 0.8B | **10,861** | 4,622 | **12,962** | 7,117 | **353** | 168 | **2.10×** |
| Qwen 3.5 4B | **3,304** | 1,972 | **3,321** | 2,670 | **165** | 93 | **1.78×** |
| Qwen 3.5 9B | **1,920** | 1,428 | 1,919 | **1,970** | **122** | 71 | **1.71×** |

hipfire wins decode 1.7–2.1× across the board — that's the user-visible
number for interactive chat. Prefill is more nuanced: hipfire wins
decisively on 0.8B / 4B and at pp128 for 9B (batched MQ4 fused
projections saturate WMMA on small matmuls where llama.cpp's per-token
GGUF dequant can't), but ollama edges past at pp512 for 9B (the GEMMs
are large enough there to saturate even without WMMA).

Harness: [`cli/bench_vs_ollama.ts`](../cli/bench_vs_ollama.ts).

## Other arches

Decode tok/s, default config:

| Arch | Examples | 0.8B | 4B | 9B | 27B |
|---|---|---:|---:|---:|---:|
| RDNA2 (gfx1030) | V620 Pro, RX 6800 XT | 250 | — | 65 | 22 |
| RDNA1 (gfx1010) | RX 5700 XT | 190 | 61 | 43 (HF4) | OOM |
| APU (gfx1013) | BC-250 | 207 | 77 | 47 | OOM |
| MI300X (gfx942) | datacenter | 850 | 480 | 320 | 130 |

MI300X is wave64 + MFMA — different kernel family. RDNA4 (gfx1200 /
gfx1201) ships a dispatch fallback to dot2 today; per-arch WMMA
kernels are in progress (issue #54).

## Reproducing

```bash
hipfire bench qwen3.5:9b
```

Runs the canonical bench (pp32 / pp128 / decode) on a fresh build
against the committed speed-baselines in
`tests/speed-baselines/<arch>.txt`. The same harness gates
pre-commit when kernel or dispatch code changes.

For DFlash perf comparison, use the prompt-md5-pinned scripts in
`benchmarks/prompts/` — see `methodology/perf-benchmarking.md` for why
prompt structure matters as much as model + flags (one stray newline
swings τ by 17%).