--- name: add-sgl-kernel description: Step-by-step tutorial for adding a heavyweight AOT CUDA/C++ kernel to sgl-kernel (including tests & benchmarks) --- # Tutorial: Adding a New Kernel to `sgl-kernel` (AOT / Heavyweight) This tutorial walks through adding a simple element-wise scale operation as an AOT kernel. We'll implement `scale(x, factor) = x * factor` to demonstrate the complete workflow. ## Goal Add a new operation that scales each element of a tensor by a scalar factor: - Input: tensor `x` (CUDA) and scalar `factor` (float) - Output: `x * factor` (element-wise, in-place or into pre-allocated `out`) - Supported dtypes: **FP16 (`torch.float16`), BF16 (`torch.bfloat16`), FP32 (`torch.float32`)** - Dispatched via `DISPATCH_PYTORCH_DTYPE_TO_CTYPE_FLOAT_FP16` macro (defined in `sgl-kernel/include/utils.h`) ## Two rules of thumb (must follow) 1. **Prefer `python/sglang/jit_kernel` first** when the kernel does **not** depend on CUTLASS or another large C++ project. This is the default path for lightweight kernels that benefit from rapid iteration. 2. **Prefer `sgl-kernel`** when the kernel **does** depend on CUTLASS or another large C++ project, or when it should be part of the AOT wheel / torch op registration flow. 3. **Exception**: if the dependency is `flashinfer`, or CUTLASS that is already provided through `flashinfer`, the kernel can still be implemented as `jit_kernel`. In addition, every new kernel must ship with: - **Tests** (pytest) - **A benchmark script** (triton.testing) --- ## Repository integration map You will typically touch these files/areas: - Implementation: `sgl-kernel/csrc/elementwise/scale.cu` (pick the right subdirectory) - Public declarations: `sgl-kernel/include/sgl_kernel_ops.h` - Torch extension registration: `sgl-kernel/csrc/common_extension.cc` - Build: `sgl-kernel/CMakeLists.txt` (`set(SOURCES ...)`) - Python API: `sgl-kernel/python/sgl_kernel/` and `sgl-kernel/python/sgl_kernel/__init__.py` - Tests: `sgl-kernel/tests/test_scale.py` - Benchmarks: `sgl-kernel/benchmark/bench_scale.py` --- ## Step 1: Implement the kernel in `csrc/` Pick the right subdirectory: - `csrc/elementwise/` — for element-wise ops (our example) - `csrc/gemm/`, `csrc/attention/`, `csrc/moe/` — for other categories Create `sgl-kernel/csrc/elementwise/scale.cu`: ```cpp #include #include #include #include "utils.h" // DISPATCH_PYTORCH_DTYPE_TO_CTYPE_FLOAT_FP16 // scale_kernel: out[i] = input[i] * factor // Supports float, half (__half), __nv_bfloat16 via template T template __global__ void scale_kernel(T* __restrict__ out, const T* __restrict__ input, float factor, int64_t n) { int64_t idx = static_cast(blockIdx.x) * blockDim.x + threadIdx.x; if (idx < n) { out[idx] = static_cast(static_cast(input[idx]) * factor); } } void scale(at::Tensor& out, const at::Tensor& input, double factor) { TORCH_CHECK(input.is_cuda(), "input must be a CUDA tensor"); TORCH_CHECK(input.is_contiguous(), "input must be contiguous"); TORCH_CHECK(out.is_cuda(), "out must be a CUDA tensor"); TORCH_CHECK(out.is_contiguous(), "out must be contiguous"); TORCH_CHECK(out.sizes() == input.sizes(), "out and input must have the same shape"); TORCH_CHECK(out.scalar_type() == input.scalar_type(), "out and input must have the same dtype"); const int64_t n = input.numel(); const int threads = 256; const int blocks = (n + threads - 1) / threads; const cudaStream_t stream = at::cuda::getCurrentCUDAStream(); const at::cuda::OptionalCUDAGuard device_guard(device_of(input)); // Dispatches over float, float16, bfloat16 DISPATCH_PYTORCH_DTYPE_TO_CTYPE_FLOAT_FP16(input.scalar_type(), c_type, [&] { scale_kernel<<>>( static_cast(out.data_ptr()), static_cast(input.data_ptr()), static_cast(factor), n); cudaError_t status = cudaGetLastError(); TORCH_CHECK(status == cudaSuccess, "scale_kernel launch failed: ", cudaGetErrorString(status)); return true; }); } ``` **Key points:** - Use `at::Tensor` (PyTorch tensors), `TORCH_CHECK` for validation, `at::cuda::getCurrentCUDAStream()` for stream - `DISPATCH_PYTORCH_DTYPE_TO_CTYPE_FLOAT_FP16` covers `float`, `half` (FP16), `__nv_bfloat16` (BF16) - Add device error checking after every kernel launch - If a kernel only works on certain architectures, enforce that with `TORCH_CHECK` and skip logic in tests --- ## Step 2: Add a C++ declaration in `include/sgl_kernel_ops.h` Edit `sgl-kernel/include/sgl_kernel_ops.h`, add to the elementwise section: ```cpp void scale(at::Tensor& out, const at::Tensor& input, double factor); ``` --- ## Step 3: Register the op in `csrc/common_extension.cc` Edit `sgl-kernel/csrc/common_extension.cc`, inside `TORCH_LIBRARY_FRAGMENT(sgl_kernel, m)`: ```cpp // From csrc/elementwise m.def("scale(Tensor! out, Tensor input, float factor) -> ()"); m.impl("scale", torch::kCUDA, &scale); ``` **Key points:** - `Tensor!` means in-place / mutable output argument - The schema is important for `torch.compile` and for consistent call signatures - If your underlying C++ API uses `float` but PyTorch bindings expect `double`, the implicit cast is fine for scalars; use shims if needed for other types --- ## Step 4: Add the new source file to `CMakeLists.txt` Edit `sgl-kernel/CMakeLists.txt`, add to `set(SOURCES ...)`: ```cmake csrc/elementwise/scale.cu ``` **Key points:** - Keep the list **alphabetically sorted** (the file explicitly requires this) - If the kernel has arch constraints, reflect that in tests/benchmarks via skip logic --- ## Step 5: Expose a Python API under `sgl-kernel/python/sgl_kernel/` Prefer following the existing module organization first. For elementwise kernels, the usual pattern is: - implement the Python wrapper in `sgl-kernel/python/sgl_kernel/elementwise.py` - then re-export it from `sgl-kernel/python/sgl_kernel/__init__.py` For example, in `sgl-kernel/python/sgl_kernel/elementwise.py`, add: ```python import torch def scale( input: torch.Tensor, factor: float, out: torch.Tensor | None = None, ) -> torch.Tensor: """ Element-wise scale: out = input * factor. Supported dtypes: torch.float16, torch.bfloat16, torch.float32. Parameters ---------- input : CUDA input tensor factor : scale factor (float) out : optional pre-allocated CUDA output tensor (same shape/dtype as input) """ if out is None: out = torch.empty_like(input) torch.ops.sgl_kernel.scale.default(out, input, factor) return out ``` Then re-export it from `sgl-kernel/python/sgl_kernel/__init__.py` following the existing import style used by other kernels. --- ## Step 6: Write tests (required) Create `sgl-kernel/tests/test_scale.py`: ```python import pytest import torch import sgl_kernel @pytest.mark.parametrize("dtype", [torch.float16, torch.bfloat16, torch.float32]) @pytest.mark.parametrize("size", [128, 1024, 4096, 65536]) @pytest.mark.parametrize("factor", [0.5, 1.0, 2.0]) def test_scale_correctness(dtype, size, factor): input = torch.randn(size, dtype=dtype, device="cuda") out = torch.empty_like(input) result = sgl_kernel.scale(input, factor, out=out) assert result is out expected = input * factor rtol, atol = (1e-5, 1e-6) if dtype == torch.float32 else (1e-2, 1e-2) torch.testing.assert_close(out, expected, rtol=rtol, atol=atol) def test_scale_shape_mismatch(): input = torch.randn(128, dtype=torch.float16, device="cuda") out = torch.empty(256, dtype=torch.float16, device="cuda") with pytest.raises(RuntimeError, match="same shape"): sgl_kernel.scale(input, 2.0, out=out) def test_scale_cpu_input(): input = torch.randn(128, dtype=torch.float16) # CPU out = torch.empty_like(input) with pytest.raises(RuntimeError, match="CUDA"): sgl_kernel.scale(input, 2.0, out=out) if __name__ == "__main__": pytest.main([__file__, "-q"]) ``` --- ## Step 7: Add a benchmark (required) Create `sgl-kernel/benchmark/bench_scale.py`: ```python import itertools import os import torch import triton import triton.testing import sgl_kernel IS_CI = ( os.getenv("CI", "false").lower() == "true" or os.getenv("GITHUB_ACTIONS", "false").lower() == "true" ) dtypes = [torch.float16] if IS_CI else [torch.float16, torch.bfloat16, torch.float32] sizes = [4096] if IS_CI else [2**n for n in range(10, 20)] # 1K … 512K factors = [2.0] configs = list(itertools.product(dtypes, sizes)) def torch_scale(input: torch.Tensor, factor: float) -> torch.Tensor: return input * factor @triton.testing.perf_report( triton.testing.Benchmark( x_names=["dtype", "size"], x_vals=configs, line_arg="provider", line_vals=["sglang", "torch"], line_names=["SGL Kernel", "PyTorch"], styles=[("green", "-"), ("red", "--")], ylabel="µs (median)", plot_name="scale-performance", args={}, ) ) def benchmark(dtype, size, provider): input = torch.randn(size, dtype=dtype, device="cuda") out = torch.empty_like(input) factor = 2.0 if provider == "sglang": fn = lambda: sgl_kernel.scale(input, factor, out=out) else: fn = lambda: torch_scale(input, factor) ms, min_ms, max_ms = triton.testing.do_bench_cudagraph( fn, quantiles=[0.5, 0.2, 0.8] ) return 1000 * ms, 1000 * max_ms, 1000 * min_ms if __name__ == "__main__": benchmark.run(print_data=True) ``` --- ## Step 8: Build Build: ```bash cd sgl-kernel make build -j16 ``` If you need to limit host resource usage: ```bash cd sgl-kernel make build -j1 MAX_JOBS=2 CMAKE_ARGS="-DSGL_KERNEL_COMPILE_THREADS=1" ``` --- ## Step 9: Validate After building successfully, run the test and benchmark: ```bash pytest sgl-kernel/tests/test_scale.py -q python sgl-kernel/benchmark/bench_scale.py ``` --- ## Troubleshooting - **Async CUDA errors**: `CUDA_LAUNCH_BLOCKING=1` - **Memory errors**: `compute-sanitizer --tool memcheck python ...` - **Build is too slow / OOM**: reduce `MAX_JOBS` and `SGL_KERNEL_COMPILE_THREADS` - **Binary bloat**: use `sgl-kernel/analyze_whl_kernel_sizes.py` - **CMake sources list**: if your `.cu` file is missing from `SOURCES`, the symbol will be undefined at link time --- ## References - `sgl-kernel/README.md` - `sgl-kernel/include/sgl_kernel_ops.h` - `sgl-kernel/csrc/common_extension.cc` - `sgl-kernel/CMakeLists.txt` - `sgl-kernel/include/utils.h` — `DISPATCH_PYTORCH_DTYPE_TO_CTYPE_FLOAT_FP16` macro and friends - `sgl-kernel/csrc/elementwise/activation.cu` — reference for the FP16/BF16/FP32 dispatch pattern ## Summary of Files Created/Modified ``` sgl-kernel/csrc/elementwise/scale.cu # NEW: CUDA kernel + launcher sgl-kernel/include/sgl_kernel_ops.h # MODIFIED: C++ declaration sgl-kernel/csrc/common_extension.cc # MODIFIED: schema + dispatch registration sgl-kernel/CMakeLists.txt # MODIFIED: add source file (alphabetical) sgl-kernel/python/sgl_kernel/elementwise.py # MODIFIED: Python wrapper sgl-kernel/python/sgl_kernel/__init__.py # MODIFIED: re-export Python API sgl-kernel/tests/test_scale.py # NEW: tests sgl-kernel/benchmark/bench_scale.py # NEW: benchmark ```