# Chapter 07 — BitNet b1.58 End-to-End This chapter walks through Hesper's first complete inference engine: BitNet b1.58 2B. Highlights: - **125 TPS on M4 Max** via WebGPU/Metal. - **40 TPS on RTX 4070 Ti** via WebGPU/Vulkan. - LoRA-style instruction fine-tuning with a verified backward pass. For deeper LoRA training details see [`docs/LORA_FINETUNING.md`](../../LORA_FINETUNING.md). ## Run inference ```bash lake exe bitnet-complete --stats ``` Expected output: ```text > Hello, world! Hello, world! I'm a 20-year-old college student... Performance: 125.6 TPS (8.0 ms/token) Model: BitNet b1.58 2B (30 layers, 2560 dim, i2_s ternary weights) ``` ## What's interesting about BitNet BitNet b1.58 quantises every weight to `{-1, 0, +1}` (1.58 bits per weight). The forward pass becomes additions only — no fp multiplies in the matmul path. The challenge for a GPU implementation is squeezing useful throughput out of an op that's normally memory-bound by the weight reads. Hesper's recipe: | Optimisation | What it does | Source | |---|---|---| | **i2_s ternary kernel** | Pack 4 weights into 8 bits; matmul = popcount-style add | `Hesper/Models/BitNet.lean` | | **Flash attention** | Fused score + online softmax + apply in one kernel | `Hesper/Layers/FlashAttention.lean` | | **Fused gate+up+ReLU²×mul** | One FFN dispatch instead of four | `Hesper/Layers/*` | | **Fused KV cache write** | Score + scatter into one kernel | `Hesper/Layers/Attention.lean` | | **F16 LM-head matmul** | Shared-memory tile across the 128 K vocab | `Hesper/Models/BitNet.lean` | | **PreparedDispatch capture** | 99 % pipeline-cache hit rate | `Hesper/Compute.lean` | | **Single GPU submit/token** | Command-buffer batching | `Examples/BitNetComplete.lean` | | **KV cache + GQA** | 20 heads / 5 KV heads | `Hesper/Models/BitNet.lean` | ## Sketch of the inference loop The driver in `Examples/BitNetComplete.lean` wires this up end-to-end. Schematically: ```text let dev ← Hesper.Device.create let model ← BitNet.load dev "data/bitnet-1.58-2b.bin" let tok ← BitNet.Tokenizer.load "data/bitnet-tokenizer.json" let prompt := "Hello, world!" let mut state := BitNet.State.init model let mut tokens := tok.encode prompt for _ in [0:64] do let logits ← BitNet.forward model state tokens.back! let next := BitNet.argmax logits tokens := tokens.push next state := BitNet.advance state IO.print (tok.decode #[next]) IO.println "" ``` (The exact function names live in `Hesper/Models/BitNet.lean` and the `Examples/BitNetComplete.lean` driver — they're plumbing-heavy and not worth reproducing verbatim in a tutorial.) `BitNet.forward` is where every fused kernel actually runs. Internally it does, per token: 1. Embed the token. 2. For each of the 30 transformer layers: - RMSNorm (fused with quant-pack of the input). - QKV projection (i2_s matmul, 4-warp coop K). - RoPE-Q in place, RoPE-K + KV scatter fused. - Flash attention (vec-kernel, K-parallel, sub-warp partition). - Output projection (i2_s). - Residual + post-attention RMSNorm fused. - Gate / up / ReLU² × mul / down (one fused kernel). 3. LM head: F16 shared-memory matmul into the 128 K vocabulary. 4. Argmax → next token. ## LoRA fine-tuning ```bash lake exe lora-train data/alpaca.jsonl ``` Trains a low-rank adapter on Alpaca-style data, using the verified-AD layer from Ch03. The training loop uses the same kernels as inference plus a single backward pass per fused op (see Ch06). LoRA weights save out as a small adapter file you can swap into the inference binary at load time. ## Reading the source Start here: - `Hesper/Models/BitNet.lean` — the top-level transformer and per-layer kernels. - `Hesper/Layers/FlashAttention.lean` — flash attention shared with Gemma 4. - `Examples/BitNetComplete.lean` — the inference driver wired up to `bitnet-complete`. - `Examples/MachineLearning/` — LoRA training drivers. ## What's next - [Chapter 08 — Gemma 4 End-to-End](Ch08_Gemma4.md): a larger transformer on the CUDA backend with quantised weights (Q4_K_M / Q6_K).