# Chapter 05 — Switching Backends (WebGPU / CUDA) Hesper compiles the same `ShaderM` source to two backends: - **WebGPU** via Google Dawn (Metal on macOS, Vulkan on Linux, D3D12 on Windows). - **CUDA PTX** via the libcuda driver API. You pick the backend at compile time. Most code stays backend-agnostic through the `GPUBackend` typeclass. ## The `GPUBackend` typeclass ```lean import Hesper.Backend open Hesper -- The typeclass is parameterised by a context type β (the "device"). -- Every backend supplies a Buf type and the operations we need: #check @GPUBackend -- @GPUBackend : (β : Type) → Type 1 -- Some of the methods that backends must implement: #check @GPUBackend.allocBuffer -- allocBuffer : ∀ {β} [GPUBackend β], β → USize → IO _ #check @GPUBackend.executeWithConfig #check @GPUBackend.readBuffer ``` The full interface is in `Hesper/Backend.lean`. Two concrete instances ship today: `Hesper.Backend.WebGPU` and `Hesper.Backend.CUDA`. They share the same `ShaderM` input — the kernels you wrote in Ch02 work unchanged on both. ## Choosing at build time with `-Kgpu` The lakefile exposes a `gpu` configuration flag: ```bash lake -Kgpu=cuda build my-app # CUDA exe; -lcuda is added lake -Kgpu=cpu build my-app # CPU-only (no CUDA, no -lcuda) lake -Kgpu=auto build my-app # default: probe; CUDA if available ``` `auto` means: if `cmake` is present *and* (`libcuda.so` or `nvcc`) exists, configure for CUDA; otherwise fall back to CPU/WebGPU paths. The flag controls *only* whether `-lcuda` (the NVIDIA driver shim) links into the final executable. Dawn / WebGPU is always built — it's how every shader-target backend works on macOS, Linux, and Windows. ## What changes between backends | Aspect | WebGPU | CUDA PTX | |---|---|---| | Surface language | WGSL | PTX 8.7 (NVIDIA) | | Tensor cores | via `subgroupMatrix*` → WMMA (limited) | `wmma.mma.sync` directly | | Async memcpy | command-buffer batching | `cuMemcpyHtoD_v2` + streams | | Capture / graphs | n/a in Dawn | CUDA Graphs | | Driver | Dawn / Metal / Vulkan / D3D12 | `libcuda.so.1` | For most layers the difference is invisible; for the perf-critical kernels (Q4_K matmul, flash attention, RMSNorm) the CUDA backend generates code tuned for NVIDIA's dp4a, `cp.async`, and `wmma` instructions, which WGSL doesn't expose. ## The `gemma4-cuda` entry point The Gemma 4 model is wired to CUDA only — it depends on `wmma.mma.sync` and `dp4a`. Build and run it like any CUDA executable: ```bash lake -Kgpu=cuda build gemma4-cuda HESPER_CHAT=1 \ ./.lake/build/bin/gemma4-cuda data/gemma-4-e4b-it-Q4_K_M.gguf "Hello" 30 ``` Ch08 walks through everything this entry point does end-to-end. ## What's next - [Chapter 06 — Proofs](Ch06_Proofs.md): what we mean by "verified", with the equivalence layer. - [Chapter 08 — Gemma 4 End-to-End](Ch08_Gemma4.md): the CUDA path applied to a 4 B-parameter LLM.