# Chapter 08 — Gemma 4 End-to-End

Gemma 4 is Google's open-weights LLM family. Hesper supports the
**E4B (efficient 4 B-parameter)** instruction-tuned variant via the
CUDA backend. This chapter shows how to load a GGUF file, run greedy
or chat-template decoding, and read the performance counters.

## Prerequisites

- NVIDIA GPU with compute capability ≥ 8.0 (`sm_80`+).
- Driver supporting CUDA Toolkit 12.x.
- A Q4_K_M or Q6_K GGUF file of Gemma 4 E4B (e.g. from Hugging Face).

## Hello, Gemma

```bash
lake -Kgpu=cuda build gemma4-cuda
HESPER_CHAT=1 \
  ./.lake/build/bin/gemma4-cuda data/gemma-4-e4b-it-Q4_K_M.gguf "Hello" 30
```

Expected output (greedy decode, chat template enabled):

```
Hello! How can I help you today? 😊
```

Without `HESPER_CHAT=1`, the model sees the raw prompt and produces
its base-model continuation — useful for sanity-checking the kernel
path but not for chat-style use.

## What runs under the hood

`gemma4-cuda` does this on every token:

1. **Embed.** Look up the token in the embedding table.
2. **Per-layer PLE.** Gemma 4 uses Per-Layer Embeddings — a small extra
   table fetched on demand from CPU mmap to keep VRAM small.
3. **42 transformer blocks.** Each block:
   - Fused RMSNorm + Q8_1 quantize.
   - QKV projection (Q4_K matmul, MMQ5 tile shape for prefill, dp4a
     4-warp for decode).
   - RoPE-K + KV scatter fused.
   - Flash attention V11 (sub-warp partition, K-parallel, split-K).
   - Output projection (Q4_K, 4-warp coop-K).
   - Post-attention RMSNorm + residual fused.
   - Gate / up / GELU-quick (Q4_K matmul, ncols_dst=2).
   - FFN-down (Q6_K matmul, 4-warp 1-row).
4. **LM head.** Q6_K matmul pre-dequantized to F16 for the 256 K
   vocabulary.
5. **Argmax on device.** No DtoH bubble per token.

The whole sequence runs inside a single **CUDA Graph** (default ON), so
host overhead per token is one `cuGraphLaunch` call.

## Performance characteristics

On an RTX 4070 Ti:

| Workload | TPS | Notes |
|---|---|---|
| Decode (32-token prompt) | ~100 | CUDA Graphs ON, MMQ default for prefill |
| Prefill (seqLen 70) | ~17 ms | MMQ5 (llama.cpp-shape tile, mmq_y=128, mmq_x=64) |
| Cold start | +1.4 s | PTX JIT — cubin cache eliminates this on repeats |

The kernel times themselves are within ~3 % of llama.cpp's CUDA backend
(measured separately with `nsys`). The remaining wall-clock gap is host
overlap, not raw matmul throughput.

## Useful environment knobs

```bash
HESPER_CHAT=1                  # apply the IT chat template
HESPER_DP4A=1                  # force dp4a path for decode
HESPER_PREFILL_MMQ2_OFF=1      # disable MMQ for prefill (use dp4a)
HESPER_CUDA_GRAPHS=0           # disable CUDA Graph capture
HESPER_PIN_MMAP=1              # cuMemHostRegister the GGUF mmap region
HESPER_USE_MMAP=1              # mmap the GGUF instead of fread
```

The `HESPER_*` flags are documented in `Hesper/Models/Gemma4/Config.lean`.

## Reading the source

- `Hesper/Models/Gemma4/Gemma4.lean` — top-level forward and decode loop.
- `Hesper/Models/Gemma4/Linear.lean` — Q4_K / Q6_K matmul dispatchers.
- `Hesper/Layers/FlashAttention.lean` — the V11 vec kernel.
- `Hesper/Layers/RMSNorm.lean` — fused RMSNorm + Q8_1 quantize.
- `Hesper/IO/GGUF.lean` — GGUF loader; mmap + on-demand H2D for PLE.
- `Examples/Gemma4/Main.lean` — the `gemma4-cuda` driver.

## Parity test suite

26 parity tests verify each Gemma 4 component against llama.cpp's CPU
reference output, byte-for-byte:

```bash
lake -Kgpu=cuda build gemma4-qproj-parity
lake -Kgpu=cuda build gemma4-ffn-parity
lake -Kgpu=cuda build gemma4-kv-parity
# etc.
```

These tests use `scripts/llama_parity/` to dump ggml-CPU output to a
file and compare it byte-by-byte with the hesper GPU output. They are
how we caught the ShaderM `if_` branch CSE leak and several quantization
off-by-ones during bring-up.

## What's next

- [Chapter 09 — Embedding Hesper in Other Projects](Ch09_Embedding.md):
  add Hesper as a dependency in your own package.
- [Chapter 10 — Architecture](Ch10_Architecture.md): how the pieces in
  this chapter actually fit together, with diagrams.