--- name: cuda-guide description: | CUDA/GPU computing guardrails, patterns, and best practices for AI-assisted development. Use when working with CUDA files (.cu, .cuh), or when the user mentions CUDA/GPU programming. Provides kernel design patterns, memory hierarchy guidelines, and occupancy optimization specific to this project's coding standards. license: MIT metadata: author: samuel version: "1.0" category: language language: cuda extensions: ".cu,.cuh" --- # CUDA Guide > Applies to: CUDA 11+, GPU Computing, Deep Learning, Scientific Computing, HPC ## Core Principles 1. **Parallelism First**: Design algorithms for thousands of concurrent threads; serial thinking is the primary enemy of GPU performance 2. **Memory Hierarchy Awareness**: Global memory is 100x slower than shared memory and 1000x slower than registers; every kernel design starts with memory access planning 3. **Coalesced Access**: Adjacent threads must access adjacent memory addresses; a single misaligned access pattern can reduce bandwidth by 32x 4. **Occupancy Over Cleverness**: Maximize active warps per SM by managing register count, shared memory usage, and block dimensions together 5. **Minimize Host-Device Transfers**: PCIe bandwidth is the bottleneck; overlap transfers with computation using streams and pinned memory ## Guardrails ### Error Checking - ALWAYS check CUDA API return values with a macro wrapper - ALWAYS call `cudaGetLastError()` after every kernel launch - ALWAYS call `cudaDeviceSynchronize()` before reading kernel results on the host - Use `compute-sanitizer` (successor to `cuda-memcheck`) in development builds - Handle `cudaErrorMemoryAllocation` gracefully; never assume GPU memory is infinite ```cuda #define CUDA_CHECK(call) \ do { \ cudaError_t err = call; \ if (err != cudaSuccess) { \ fprintf(stderr, "CUDA error at %s:%d: %s\n", \ __FILE__, __LINE__, cudaGetErrorString(err)); \ exit(EXIT_FAILURE); \ } \ } while (0) #define CUDA_CHECK_KERNEL() \ do { \ cudaError_t err = cudaGetLastError(); \ if (err != cudaSuccess) { \ fprintf(stderr, "Kernel launch error at %s:%d: %s\n", \ __FILE__, __LINE__, cudaGetErrorString(err)); \ exit(EXIT_FAILURE); \ } \ } while (0) ``` ### Memory Management - Pair every `cudaMalloc` with a `cudaFree`; prefer RAII wrappers in C++ host code - Use `cudaMallocManaged` (Unified Memory) for prototyping; switch to explicit transfers for production - Use `cudaMallocHost` (pinned memory) when streaming data to the GPU; pageable memory cannot overlap with compute - Prefer `cudaMemcpyAsync` with streams over synchronous `cudaMemcpy` - Never access device pointers from host code or host pointers from device code (except Unified Memory) - Call `cudaMemset` or `cudaMemsetAsync` to zero-initialize device buffers ### Kernel Design - Block size must be a multiple of warp size (32); prefer 128, 256, or 512 - Calculate grid size as `(n + block_size - 1) / block_size` - Always include bounds checking: `if (idx < n)` at the top of every kernel - Use grid-stride loops for kernels that must handle arbitrary data sizes - Document thread mapping: which dimension maps to which data axis - Mark device-only helpers as `__device__`, host+device as `__host__ __device__` ```cuda // Grid-stride loop: works with any grid size, any data size __global__ void saxpy(float a, const float* x, float* y, int n) { for (int i = blockIdx.x * blockDim.x + threadIdx.x; i < n; i += blockDim.x * gridDim.x) { y[i] = a * x[i] + y[i]; } } ``` ### Synchronization - Use `__syncthreads()` after every shared memory write before any thread reads another thread's value - Never place `__syncthreads()` inside a conditional branch that not all threads in a block will reach (deadlock) - Use `__syncwarp()` (CUDA 9+) for warp-level synchronization instead of relying on implicit warp-synchronous execution - Use `cudaDeviceSynchronize()` sparingly in production; prefer stream synchronization with `cudaStreamSynchronize()` - Use CUDA events (`cudaEventRecord` / `cudaEventSynchronize`) for fine-grained inter-stream ordering ### Performance - Profile before optimizing: use Nsight Compute for kernel analysis, Nsight Systems for system-level view - Target >50% theoretical occupancy; use the CUDA Occupancy Calculator to tune block dimensions - Aim for >60% of peak memory bandwidth in memory-bound kernels - Avoid warp divergence: ensure threads within a warp take the same branch when possible - Prefer `float` over `double` on consumer GPUs (2x throughput difference) - Minimize atomic operations on global memory; use shared memory atomics with a final reduction ## Memory Hierarchy Understanding the memory hierarchy is the single most important factor in CUDA performance. | Memory Type | Scope | Latency (cycles) | Size | Cached | Read/Write | |-------------|-------|-------------------|------|--------|------------| | Registers | Thread | 1 | ~255 per thread | N/A | R/W | | Shared | Block | ~5 | 48-164 KB per SM | N/A | R/W | | L1 Cache | SM | ~28 | 48-192 KB per SM | Auto | R | | L2 Cache | Device | ~200 | 4-40 MB | Auto | R/W | | Global | Device | ~400-600 | 4-80 GB (HBM/GDDR) | Yes | R/W | | Constant | Device | ~5 (cached) | 64 KB | Yes (broadcast) | R | | Texture | Device | ~400 (cached) | Global pool | Yes (spatial) | R | **Decision guide:** - Data reused within a thread -> registers (automatic via local variables) - Data shared across threads in a block -> `__shared__` memory - Read-only data broadcast to all threads -> `__constant__` memory - Large read-only data with spatial locality -> texture memory - Everything else -> global memory with coalesced access patterns ## Key Patterns ### Kernel Launch Configuration ```cuda // Query device for optimal configuration void launch_optimized(const float* input, float* output, int n) { int block_size; int min_grid_size; // Let the runtime suggest optimal block size for maximum occupancy cudaOccupancyMaxPotentialBlockSize( &min_grid_size, &block_size, my_kernel, 0, n); int grid_size = (n + block_size - 1) / block_size; my_kernel<<>>(input, output, n); CUDA_CHECK_KERNEL(); } ``` ### Coalesced Memory Access ```cuda // BAD: Strided access -- adjacent threads access non-adjacent memory // Each warp issues 32 separate memory transactions __global__ void transpose_naive(const float* in, float* out, int W, int H) { int x = blockIdx.x * blockDim.x + threadIdx.x; int y = blockIdx.y * blockDim.y + threadIdx.y; if (x < W && y < H) { out[x * H + y] = in[y * W + x]; // Write is strided } } // GOOD: Use shared memory to coalesce both reads and writes __global__ void transpose_coalesced( const float* in, float* out, int W, int H ) { __shared__ float tile[32][33]; // +1 padding avoids bank conflicts int x = blockIdx.x * 32 + threadIdx.x; int y = blockIdx.y * 32 + threadIdx.y; if (x < W && y < H) { tile[threadIdx.y][threadIdx.x] = in[y * W + x]; // Coalesced read } __syncthreads(); x = blockIdx.y * 32 + threadIdx.x; y = blockIdx.x * 32 + threadIdx.y; if (x < H && y < W) { out[y * H + x] = tile[threadIdx.x][threadIdx.y]; // Coalesced write } } ``` ### Shared Memory Tiling ```cuda // Dot product of two vectors using shared memory reduction __global__ void dot_product( const float* a, const float* b, float* result, int n ) { __shared__ float cache[256]; int tid = threadIdx.x; int idx = blockIdx.x * blockDim.x + threadIdx.x; // Each thread computes its partial sum via grid-stride float partial = 0.0f; for (int i = idx; i < n; i += blockDim.x * gridDim.x) { partial += a[i] * b[i]; } cache[tid] = partial; __syncthreads(); // Tree reduction in shared memory for (int s = blockDim.x / 2; s > 0; s >>= 1) { if (tid < s) { cache[tid] += cache[tid + s]; } __syncthreads(); } if (tid == 0) { atomicAdd(result, cache[0]); } } ``` ### Warp-Level Primitives (CUDA 9+) ```cuda // Warp-level reduction using shuffle instructions -- no shared memory needed __device__ float warp_reduce_sum(float val) { for (int offset = warpSize / 2; offset > 0; offset /= 2) { val += __shfl_down_sync(0xFFFFFFFF, val, offset); } return val; } // Block-level reduction combining warp shuffles and shared memory __device__ float block_reduce_sum(float val) { __shared__ float warp_sums[32]; // One slot per warp (max 32 warps/block) int lane = threadIdx.x % warpSize; int warp_id = threadIdx.x / warpSize; val = warp_reduce_sum(val); if (lane == 0) { warp_sums[warp_id] = val; } __syncthreads(); // First warp reduces the warp sums int num_warps = (blockDim.x + warpSize - 1) / warpSize; val = (threadIdx.x < num_warps) ? warp_sums[threadIdx.x] : 0.0f; if (warp_id == 0) { val = warp_reduce_sum(val); } return val; } ``` ## Performance ### Occupancy Calculator ```cuda // Query occupancy at compile time for tuning void report_occupancy() { int block_size = 256; int num_blocks; cudaOccupancyMaxActiveBlocksPerMultiprocessor( &num_blocks, my_kernel, block_size, 0); cudaDeviceProp prop; cudaGetDeviceProperties(&prop, 0); int active_warps = num_blocks * (block_size / prop.warpSize); int max_warps = prop.maxThreadsPerMultiProcessor / prop.warpSize; float occupancy = (float)active_warps / max_warps; printf("Occupancy: %.1f%% (%d/%d warps)\n", occupancy * 100, active_warps, max_warps); } ``` ### Nsight Profiling Workflow ```bash # System-level trace: find CPU/GPU idle gaps, stream concurrency nsys profile -o trace ./program nsys stats trace.nsys-rep # Kernel-level analysis: roofline, memory throughput, occupancy ncu --set full -o kernel_report ./program ncu -i kernel_report.ncu-rep # Open in Nsight Compute GUI # Quick single-metric check ncu --metrics sm__throughput.avg.pct_of_peak_sustained_elapsed ./program ``` ### Memory Bandwidth Measurement ```cuda // Measure effective bandwidth of a kernel void measure_bandwidth(int n) { size_t bytes = 2 * n * sizeof(float); // Read A + Write B cudaEvent_t start, stop; CUDA_CHECK(cudaEventCreate(&start)); CUDA_CHECK(cudaEventCreate(&stop)); CUDA_CHECK(cudaEventRecord(start)); copy_kernel<<>>(d_in, d_out, n); CUDA_CHECK(cudaEventRecord(stop)); CUDA_CHECK(cudaEventSynchronize(stop)); float ms = 0; CUDA_CHECK(cudaEventElapsedTime(&ms, start, stop)); float gb_per_sec = bytes / (ms * 1e6); printf("Effective bandwidth: %.2f GB/s\n", gb_per_sec); CUDA_CHECK(cudaEventDestroy(start)); CUDA_CHECK(cudaEventDestroy(stop)); } ``` ## Tooling ### Essential Commands ```bash # Compile CUDA code nvcc -arch=sm_80 -O3 -o program main.cu # Single file nvcc -arch=native -lineinfo -o program main.cu # With debug line info # CMake build cmake -B build -DCMAKE_CUDA_ARCHITECTURES="70;80;86" cmake --build build -j$(nproc) # Runtime debugging compute-sanitizer ./program # Memory errors (replaces cuda-memcheck) compute-sanitizer --tool racecheck ./program # Shared memory race conditions compute-sanitizer --tool initcheck ./program # Uninitialized device memory reads compute-sanitizer --tool synccheck ./program # Synchronization errors # Profiling nsys profile ./program # System-level timeline ncu ./program # Kernel-level metrics ncu --kernel-name my_kernel --launch-skip 2 --launch-count 1 ./program # Device info nvidia-smi # GPU status and memory usage nvcc --version # CUDA compiler version ``` ### CMakeLists.txt Template ```cmake cmake_minimum_required(VERSION 3.18) project(myproject LANGUAGES CXX CUDA) set(CMAKE_CXX_STANDARD 17) set(CMAKE_CUDA_STANDARD 17) set(CMAKE_CUDA_ARCHITECTURES 70 80 86) set(CMAKE_CUDA_SEPARABLE_COMPILATION ON) find_package(CUDAToolkit REQUIRED) add_library(kernels src/kernels.cu) target_include_directories(kernels PUBLIC include) add_executable(main src/main.cpp) target_link_libraries(main kernels CUDA::cudart) enable_testing() add_executable(tests tests/test_kernels.cu) target_link_libraries(tests kernels CUDA::cudart) add_test(NAME gpu_tests COMMAND tests) ``` ## References For detailed patterns and examples, see: - [references/patterns.md](references/patterns.md) -- Tiled matrix multiply, parallel reduction tree, stream overlap pipeline ## External References - [CUDA C++ Programming Guide](https://docs.nvidia.com/cuda/cuda-c-programming-guide/) - [CUDA C++ Best Practices Guide](https://docs.nvidia.com/cuda/cuda-c-best-practices-guide/) - [Nsight Compute Documentation](https://docs.nvidia.com/nsight-compute/) - [Nsight Systems Documentation](https://docs.nvidia.com/nsight-systems/) - [Thrust Documentation](https://nvidia.github.io/cccl/thrust/) - [CUDA Samples](https://github.com/NVIDIA/cuda-samples) - [NVIDIA Occupancy Calculator](https://docs.nvidia.com/cuda/cuda-occupancy-calculator/) - [GPU Memory Hierarchy (GTC talk)](https://developer.nvidia.com/blog/tag/cuda/)