# Chapter 10 — Hesper Architecture This chapter is the wide-angle view. We've gone from the DSL (Ch02) to real models (Ch07, Ch08); now let's see how the layers stack and where to look when something goes wrong. ## The five layers ``` ┌────────────────────────────────────────────────────────────────┐ │ L5 Models │ │ Hesper.Models.BitNet, Hesper.Models.Gemma4 │ │ end-to-end inference + training loops │ ├────────────────────────────────────────────────────────────────┤ │ L4 High-level API │ │ Hesper.Layers.*, Hesper.AD, Hesper.Tensor │ │ composable layers + reverse-mode autodiff + tensors │ ├────────────────────────────────────────────────────────────────┤ │ L3 Circuit DSL (kernel fusion) │ │ Hesper.Circuit │ │ Prim.pointwise / Prim.reduce / Prim.matmul / Prim.scatter │ │ + fusion passes (mergeSameDispatch, fusePointwise, …) │ ├────────────────────────────────────────────────────────────────┤ │ L2 ShaderM IR │ │ Hesper.WGSL.Monad │ │ imperative kernel construction: buffers, loops, barriers │ ├────────────────────────────────────────────────────────────────┤ │ L1 Exp (typed shader expressions) │ │ Hesper.WGSL.{Types, Exp, DSL} │ │ type-safe arithmetic, math, vector, matrix │ ├────────────────────────────────────────────────────────────────┤ │ L0 Backends │ │ Hesper.WGSL.CodeGen (→ WGSL → Dawn → Metal/Vulkan/D3D12)│ │ Hesper.CUDA.CodeGen (→ PTX → libcuda → NVIDIA GPU) │ └────────────────────────────────────────────────────────────────┘ ``` Each layer is the public interface to the one above it. You can drop down to a lower layer when you need control — model code happily mixes L4 layers with hand-written L2 kernels. ## The IR triangle Two intermediate representations exist: - **`ShaderM`** (L2) is the *imperative* IR. A `ShaderM Unit` is a block of statements over typed buffers. It's the natural form to hand-write a kernel. - **`Circuit`** (L3) is the *graph* IR. A `Circuit` is a DAG of `Prim` nodes; each node has a known forward semantics and a fusion pass rewrites the graph before lowering to ShaderM. Both lower to the same backend printers (L0). The Circuit DSL lets the compiler fuse safely; ShaderM lets you write exactly what you want. ## The fusion story The decode path used to issue ~200 kernel dispatches per token. After fusion it issues ~25, all captured inside one CUDA Graph. The wins came from chaining the following passes: | Pass | What it merges | |---|---| | `fusePointwise` | Adjacent pointwise (elementwise) ops | | `fuseReduceIntoQuantize` | RMSNorm → Q8_1 quantize | | `fuseMatmulEpilogue` | matmul + pleScale / postNormAdd / activation | | `mergeSameDispatch` | Identical dispatches scheduled together | | `fuseScatter` | RoPE-K → KV cache write | | `fuseWriteDestination` | scatter + view + writeSlice | Read the passes in `Hesper/Circuit/Passes/*.lean`. Each is < 200 lines and proves its rewrite preserves semantics. ## The backend split ``` ShaderM Unit │ ├──► WGSL printer ──► WGSL string ──► Dawn ──► driver │ │ │ ├── Metal (macOS) │ ├── Vulkan (Linux + Win) │ └── D3D12 (Windows) │ └──► PTX printer ──► PTX string ──► libcuda │ └── NVIDIA driver → SASS ``` The two backends share `Exp`, the type system, and most of `ShaderM`. The differences are concentrated in: - `Hesper/WGSL/CodeGen.lean` vs `Hesper/CUDA/CodeGen.lean` — the printers. - A few backend-specific `Exp` constructors (`dot4I8Packed`, `subgroupMatrix*`, `cp.async`) that are no-ops on the side that doesn't support them. ## Verification surfaces | Layer | What's checked | |---|---| | L1 (Exp) | Type system catches scalar/vector/matrix mismatches | | L2 (ShaderM) | Buffer-shape/array-bound constraints in types | | L3 (Circuit) | Fusion passes carry equivalence proofs | | L4 (AD) | Every `Differentiable` instance has a correctness theorem | | L5 (Models) | Numerical bit-parity tests against `llama.cpp` | The L5 tests don't replace the higher proofs — they protect against issues outside the proof scope (driver bugs, JIT differences, race conditions). ## Where to start reading the code The fastest tour of the architecture: 1. `Hesper.lean` — the top-level re-exports. 2. `Hesper/WGSL/Exp.lean` — the type-safe expression AST. 3. `Hesper/WGSL/Monad.lean` — `ShaderM` and the builder API. 4. `Hesper/Layers/RMSNorm.lean` — a small, self-contained layer. 5. `Hesper/Layers/FlashAttention.lean` — the production attention kernel. 6. `Hesper/Circuit/Lowering.lean` — Circuit → ShaderM. 7. `Hesper/Models/Gemma4/Gemma4.lean` — full model assembly. After that, follow imports. ## Architecture decisions worth knowing - **No hidden optimizer.** The DSL → backend path is deterministic; what you write is what runs (modulo backend-side ptxas / SPIR-V optimisation, but those are stable across versions). - **Fusion is opt-in by *not* writing the unfused chain.** If you call `gateProj` then `relu` then `upProj`, the Circuit DSL will fuse them — and you can verify by dumping the generated WGSL/PTX. - **No global state.** Devices and buffers are explicit; there's no implicit "current device" like PyTorch's `torch.cuda.current_device()`. - **Lean is the build system.** Lake compiles the library, schedules native deps, runs tests, generates docs. No CMake-on-CMake. ## End of the tutorial That's the tour. From here: - Skim `Hesper/Models/Gemma4/Gemma4.lean` — it uses every layer above. - Try modifying `Hesper/Layers/FlashAttention.lean` and re-running `lake exe gemma4-cuda` to see how the change propagates. - Read [`docs/research/`](../../research/) if you want the messy history behind each kernel's current shape — every shortcut has a debug log. Happy hacking.