--- name: cuda-kernel-refine description: Iterative CUDA kernel optimization using NVIDIA profiling tools (nsys, ncu). Use when optimizing kernels, improving throughput, reducing bandwidth, analyzing roofline, comparing benchmarks, or investigating register pressure and occupancy. --- # CUDA Kernel Refinement Loop **One change per cycle:** baseline → profile → classify → optimize → verify → compare → loop. Multiple simultaneous changes make it impossible to attribute improvement. Revert on regression. ## 1. Establish Baseline Discover the project's benchmark infrastructure before running anything: 1. Check `Makefile` for bench/profile targets (`make bench`, `make profile`, etc.) 2. Check for benchmark binaries (`cargo bench`, `pytest-benchmark`, a custom `e2e_bench` binary) 3. Look at CI scripts or README for the canonical benchmark invocation 4. If no benchmark exists, write a minimal one that exercises the hot path with measurable output Run with enough iterations for stable results — coefficient of variation <5%. ```bash # Save baseline (JSON or structured output preferred for diff later) ./benchmark --iterations 5 --json-output /tmp/bench-baseline.json ``` Record: primary metric (tok/s, GFLOPS, GB/s, or latency ms), stddev, and workload parameters (input size, batch size, context depth). You need all three for meaningful comparison later. ## 2. Profile with nsys ### Capture ```bash nsys profile --stats=true -o /tmp/nsys-profile ./app [args] ``` ### Quick look (JSON) ```bash # Kernel timing summary — sort by Time(%) to find hotspot nsys stats -r cuda_gpu_kern_sum --format json /tmp/nsys-profile.nsys-rep # CUDA API calls (spot synchronization overhead, excessive cudaMalloc) nsys stats -r cuda_api_sum --format json /tmp/nsys-profile.nsys-rep # Memory operations by size (spot unnecessary small D2H/H2D transfers) nsys stats -r cuda_gpu_mem_size_sum --format json /tmp/nsys-profile.nsys-rep ``` ### Deep analysis (SQLite — the real power tool) For custom queries, export to SQLite and use SQL directly. This is far more powerful than JSON for kernel-level analysis. ```bash # Export nsys trace to SQLite nsys export --type sqlite /tmp/nsys-profile.nsys-rep # Creates /tmp/nsys-profile.sqlite # Top kernels by time with full names sqlite3 -header -column /tmp/nsys-profile.sqlite \ "SELECT printf('%.1f%%', 100.0*sum(k.end-k.start)/(SELECT sum(end-start) FROM CUPTI_ACTIVITY_KIND_KERNEL)) as pct, count(*) as calls, printf('%.1f', sum(k.end-k.start)/1e6) as total_ms, printf('%.1f', avg(k.end-k.start)/1e3) as avg_us, s.value as kernel FROM CUPTI_ACTIVITY_KIND_KERNEL k JOIN StringIds s ON k.shortName = s.id GROUP BY s.value ORDER BY sum(k.end-k.start) DESC LIMIT 20" # Per-token analysis (for inference: normalize by decode_tokens) sqlite3 -header -column /tmp/nsys-profile.sqlite \ "SELECT printf('%.0f', count(*) * 1.0 / ) as 'launches/tok', printf('%.1f', sum(end-start)/1e6/) as 'ms/tok', printf('%.1f', avg(end-start)/1e3) as 'us/launch' FROM CUPTI_ACTIVITY_KIND_KERNEL" # Kernel launch configs (grid, block, registers) — find dimension mismatches sqlite3 -header -column /tmp/nsys-profile.sqlite \ "SELECT s.value as kernel, k.gridX, k.gridY, k.gridZ, k.blockX, k.blockY, k.blockZ, k.registersPerThread as regs, count(*) as launches, printf('%.1f', avg(k.end-k.start)/1e3) as avg_us FROM CUPTI_ACTIVITY_KIND_KERNEL k JOIN StringIds s ON k.shortName = s.id GROUP BY s.value, k.gridX, k.gridY, k.gridZ, k.blockX, k.blockY, k.blockZ ORDER BY sum(k.end-k.start) DESC LIMIT 20" ``` Replace `` with the number of tokens generated in the trace. ### Interpret the profile - **High `Instances` + moderate `Avg`** → per-layer kernel, fusion opportunity - **Low `Instances` + high `Avg`** → single expensive kernel, optimize internals - **Many small kernels** → launch overhead ("death by a thousand cuts"), count launches/token to quantify - **High `cudaMalloc` count in API summary** → framework allocator overhead, consider memory pools ## 3. Check Register Usage ```bash nvcc -cubin -arch= -Xptxas=-v kernel.cu -o /tmp/kernel.cubin ``` Output format: ```text ptxas info : Compiling entry function 'kernel_name' for 'sm_121a' ptxas info : Function properties for kernel_name 0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads ptxas info : Used 29 registers, used 0 barriers ``` Key signals: | Signal | Meaning | Action | | ---------------------- | ----------------------------------------- | -------------------------------------------- | | Spill stores/loads > 0 | Register pressure | `__launch_bounds__` or reduce live variables | | Registers > 64 | Occupancy limited by registers | Check ncu Occupancy section | | Stack frame > 0 | Function calls or array-to-local demotion | Inline functions, use shared memory | | Barriers > 0 | Shared memory synchronization | Expected with `__syncthreads()` | ## 4. Drill Down with ncu **Requires GPU performance counter permissions.** ```bash # Default sections (SOL + occupancy), fast ncu -k "kernel_regex" -c 5 ./app [args] # Full metrics including roofline ncu -k "kernel_regex" --set full -o /tmp/ncu-report ./app [args] # CSV for specific metrics ncu --csv -k "kernel_regex" --metrics \ sm__throughput.avg.pct_of_peak_sustained_elapsed,\ dram__throughput.avg.pct_of_peak_sustained_elapsed \ ./app [args] ``` If you get `ERR_NVGPUCTRPERM`, performance counters are restricted. Fix with: ```bash # Option A: run ncu with sudo sudo ncu -k "kernel_regex" -c 5 ./app [args] # Option B: kernel module parameter (persistent, requires reboot) sudo sh -c 'echo "options nvidia NVreg_RestrictProfilingToAdminUsers=0" > /etc/modprobe.d/ncu-permissions.conf' ``` Section sets (choose based on what you need): - **Default** (`SpeedOfLight`, `Occupancy`) — fast, answers "is this compute or memory bound?" - **`--set detailed`** — adds instruction mix, memory workload breakdown, warp scheduling. Use when default sections show a bottleneck but don't explain why. - **`--set full`** — adds roofline charts. 10-50x slower. Only use when you need the visual roofline or are writing up an analysis. ## 5. Classify Bottleneck From ncu `SpeedOfLight` section (or infer from kernel type if ncu unavailable): | SM SOL % | Memory SOL % | Classification | Optimization Direction | | ----------- | ------------ | ------------------ | ------------------------------------- | | Low | High (>60%) | **Memory-bound** | Reduce traffic, wider loads, fuse ops | | High (>60%) | Low | **Compute-bound** | Fast math, ILP, tensor cores | | Low | Low | **Latency-bound** | Occupancy, reduce sync, overlap | | High | High | **Well-optimized** | Algorithmic changes or accept | ### Bandwidth utilization (memory-bound kernels) For inference workloads, compute bandwidth utilization directly — this is more actionable than abstract roofline: ```text data_per_step = weight_bytes + activation_bytes + KV_cache_bytes (read + write) theoretical_min_time = data_per_step / hw_bandwidth utilization = theoretical_min_time / actual_time ``` Example: FP4 weights read 5.1 GB/token on 273 GB/s → theoretical min 18.7ms, actual 24ms → 78% utilization. The 22% gap is your optimization headroom. ### Without ncu If no counter permissions, infer from kernel behavior: - Elementwise ops (activation functions, normalization) → almost always memory-bound - Reductions with few output elements (layernorm, softmax) → likely compute-bound on the reduction - Large matmuls → compute-bound - Many tiny kernels with low total time → latency-bound (launch overhead) ### Identify host-side bottlenecks Not all overhead lives in GPU kernels. Check for: - **D2H synchronization points**: search for device-to-host copies in the codebase — `cudaMemcpy(..., DeviceToHost)`, `cudaStreamSynchronize`, and framework-specific equivalents (PyTorch: `.item()`, `.cpu()`, `.numpy()`; Candle: `.to_vec1()`; CUDA C++: `cudaMemcpyAsync` with sync). Each one stalls the GPU pipeline. Count them per inference step — eliminating unnecessary D2H syncs is often the single biggest win. - **Excessive allocations**: check `cudaMalloc` count in `nsys stats -r cuda_api_sum`. ML frameworks (Candle, PyTorch eager) often allocate fresh GPU buffers for every intermediate tensor. Hundreds of allocations per step is a sign of framework overhead. - **CPU-side compute**: sampling, tokenization, or postprocessing running on CPU while GPU idles. Shows up as gaps between kernel launches in the nsys timeline. ## 6. Optimization Strategies ### Memory-bound Goal: reduce bytes moved per operation. - **Kernel fusion** (see latency-bound for prioritization): every intermediate tensor written to global memory and read back is wasted bandwidth - **Vectorized loads**: `uint4` (128-bit) = 8 BF16 or 4 FP32 per load. Requires pointer alignment. Verify with SASS (`LDG.E.128`). - **Shared memory reuse**: cache data accessed by multiple threads (use stride D+1 to avoid bank conflicts) - **Reduce precision**: FP8/FP4 weights cut memory traffic 2-4x with minimal accuracy loss. Keep accumulation in FP32. ### Compute-bound Goal: do more useful math per cycle. - **Fast math intrinsics**: `__expf()`, `__rsqrtf()`, `__fdividef()` — lower precision but 2-4x faster than full-precision equivalents - **Warp shuffle**: `__shfl_down_sync(0xffffffff, val, offset)` for intra-warp reductions — avoids shared memory round-trip - **ILP**: interleave independent operations so the scheduler can hide latency. `#pragma unroll` for known-trip-count loops. - **Tensor cores**: WMMA/MMA for matrix ops — 8-16x throughput over FP32 CUDA cores but requires specific data layouts and alignment ### Latency-bound (launch overhead) - **Kernel fusion** — the primary remedy. Prioritize candidates by: 1. `launches_per_step × layers` — a 3-kernel sequence in a 32-layer model eliminates 96 launches when fused 2. Memory traffic eliminated — fusing `residual_add + rmsnorm` saves one full tensor round-trip per layer 3. Implementation complexity — simple elementwise fusions (sigmoid×mul, silu×mul) are easy wins; attention-level fusions require careful shared memory management - **Eliminate D2H sync points** — move computation to GPU, batch device-to-host transfers to end of step (see section 5 "host-side bottlenecks" for how to find them) - **Increase occupancy** — `__launch_bounds__(maxThreads, minBlocks)` when ncu shows low occupancy limiting throughput ## 7. SASS Inspection (optional) Use when profiling shows an unexplained gap between expected and actual performance, or to verify the compiler is generating what you intended (vectorized loads, no spills, etc.). ```bash cuobjdump -sass /tmp/kernel.cubin # SASS disassembly nvdisasm /tmp/kernel.cubin # SASS with control flow ``` Quick checks (you don't need to read every instruction): - **Grep for `STL`/`LDL`** — local memory access means register spills. If present, reduce register pressure. - **Grep for `LDG.E.128`/`STG.E.128`** — confirms vectorized 128-bit loads/stores are being generated. If you wrote `uint4` loads but see `LDG.E.32`, alignment or type issues prevented vectorization. - **Count `BAR.SYNC`** — compare to your `__syncthreads()` count. More barriers than expected means the compiler inserted extra synchronization. ## 8. Verify Correctness **Always verify before benchmarking.** A faster kernel that produces wrong results is not an optimization. ```bash # Project tests cargo test / pytest / make test # Memory errors (out-of-bounds, leaks, misaligned access) compute-sanitizer --tool memcheck ./app [args] # Shared memory data races compute-sanitizer --tool racecheck ./app [args] # Uninitialized device memory reads compute-sanitizer --tool initcheck ./app [args] # Sync hazards (divergent __syncthreads) compute-sanitizer --tool synccheck ./app [args] ``` `compute-sanitizer` has significant overhead (10-100x). Use small inputs for correctness checks, full inputs for benchmarks. ## 9. Compare Results Run the **same benchmark** as baseline — same workload, same iteration count, same GPU state (no other processes competing for bandwidth). ```bash ./benchmark --iterations 5 --json-output /tmp/bench-after.json ``` Compare against the baseline numbers you recorded in step 1: ```text Metric Baseline After Delta ───────────── ──────── ──────── ────── Throughput 37.1 tok/s 39.8 tok/s +7.3% Stddev ±0.4 ±0.5 Launches/token 622 478 -23% GPU ms/token 26.2 24.1 -8.0% ``` Decision framework: - **Delta < stddev of either run** → indistinguishable from noise. Run more iterations (10+) or use a longer workload before concluding. - **Delta > stddev and > 3%** → real improvement, accept the change - **Any regression in throughput** → revert, then analyze why (check if a different kernel got slower — fusion can shift pressure) Also check for side effects: - Peak GPU memory (fusion can increase shared memory or register usage) - Correctness (re-run tests if not done in step 8) - Performance at different workloads (a prefill optimization might hurt decode, or vice versa) ## Common Pitfalls - **Grid dimension limits**: y and z max at 65535; only x supports 2^31-1. Use `blockIdx.x` for sequence/batch dimensions that can exceed 65535. - **Shared memory bank conflicts**: 32 banks x 4 bytes. Power-of-2 column strides cause conflicts. Pad to stride D+1. - **Divergent `__syncthreads()`**: undefined behavior. All threads in a block must reach the same barrier. Use `compute-sanitizer --tool synccheck` to detect. - **Vectorized load alignment**: `uint4` loads require 16-byte aligned pointers. Verify alignment before using wide loads. - **Profiling overhead**: nsys adds ~5-10% overhead, ncu adds 10-50x. Profile numbers are relative (compare before/after), not absolute production performance. - **`__launch_bounds__` vs `--maxrregcount`**: `--maxrregcount` is per-file (all kernels); `__launch_bounds__` is per-kernel. Prefer per-kernel control. - **Occupancy chasing**: >60% occupancy is usually sufficient. Maximizing occupancy by forcing lower register counts can cause spills that hurt more than the occupancy helps. - **Unified memory systems** (Grace Hopper, DGX Spark): no PCIe transfer overhead, but bandwidth can be 3-7x lower than discrete GDDR7. This means batch=1 decode is almost always memory-bound — prioritize weight compression (FP8/FP4) and fusion over compute optimizations. - **Hidden D2H syncs in hot loops**: check section 5 "host-side bottlenecks" — a single unnecessary GPU→CPU sync per layer × 32 layers = 32 pipeline stalls per step. Often the biggest win before touching any kernel code. - **Framework allocator overhead**: eager-mode ML frameworks allocate/free GPU buffers for every intermediate tensor. At 600+ allocations per inference step, `cudaMalloc` overhead itself becomes measurable. Check `cuda_api_sum` for `cudaMalloc` call counts. - **Fusing the wrong thing**: profile first. A kernel taking 0.1% of GPU time isn't worth fusing no matter how many launches it has. Focus on the top-3 time consumers and the top launch-count contributors.