--- name: tilelang-developer description: "Write, optimize, and debug high-performance AI compute kernels using TileLang (a Python DSL for GPU programming). Use when the user requests: (1) Writing custom GPU kernels for AI workloads (GEMM, Attention, MLA, etc.), (2) Optimizing existing TileLang code for NVIDIA, AMD, or Ascend hardware, (3) Implementing non-standard operators (like DeepSeek MLA, FlashAttention variants), (4) Debugging TileLang compilation or runtime errors, or (5) Cross-platform kernel development targeting multiple GPU vendors." --- # TileLang Developer Write high-performance AI compute kernels using TileLang - a tile-based programming model that bridges the gap between CUDA's low-level control and high-level abstractions. ## When to Use This Skill Use this skill when the user needs to: - Implement custom GPU kernels for AI operations (matrix multiplication, attention mechanisms, etc.) - Optimize performance-critical operators for modern GPUs (NVIDIA Ampere/Hopper, AMD MI300X, Ascend NPU) - Debug TileLang code or resolve performance issues - Port kernels across different hardware platforms - Understand or modify existing TileLang implementations ## Kernel Development Workflow Follow these steps when writing a TileLang kernel: ### Step 1: Analyze Requirements Gather essential information: **Input/Output Specifications:** - Tensor shapes (M, N, K dimensions) - Data types (float16, float32, bfloat16, int8) - Memory layout (row-major, column-major) **Hardware Target:** - NVIDIA GPU (Ampere A100, Hopper H100, etc.) - AMD GPU (MI300X, etc.) - Huawei Ascend NPU **Performance Goals:** - Target throughput or latency - Memory bandwidth constraints - Comparison baseline (cuBLAS, vendor libraries) **Ask clarifying questions if details are missing.** ### Step 2: Set Up Kernel Structure Create the basic kernel scaffold: ```python import tilelang import tilelang.language as T @tilelang.jit(target="cuda", out_idx=[2]) # Specify output indices def kernel_name(M, N, K, block_M, block_N, block_K): @T.prim_func def main( A: T.Buffer((M, K), "float16"), B: T.Buffer((K, N), "float16"), C: T.Buffer((M, N), "float16") ): # Kernel logic will go here pass return main ``` **Key decisions:** - `target`: "cuda" (NVIDIA), "hip" (AMD), or "cpu" - `out_idx`: List indices of output parameters - Block dimensions: Typical values are 64, 128, or 256 ### Step 3: Define Grid and Memory Hierarchy Set up computation grid and allocate memory: ```python # Define grid dimensions with T.Kernel( T.ceildiv(N, block_N), # Grid X T.ceildiv(M, block_M), # Grid Y threads=128 ) as (bx, by): # Allocate shared memory (L1 cache) A_shared = T.alloc_shared((block_M, block_K), "float16") B_shared = T.alloc_shared((block_K, block_N), "float16") # Allocate register fragments (accumulators) C_local = T.alloc_fragment((block_M, block_N), "float32") # CRITICAL: Apply swizzle layout to avoid bank conflicts T.annotate_layout({ A_shared: T.make_swizzled_layout(A_shared), B_shared: T.make_swizzled_layout(B_shared) }) ``` **Memory hierarchy:** - **Global Memory** (HBM): Input/output tensors, slowest - **Shared Memory** (L1): Explicitly managed cache, ~164KB on A100 - **Registers**: Fastest, used for accumulators and temporaries **Critical optimization:** Always apply `T.make_swizzled_layout` to shared memory to eliminate bank conflicts. ### Step 4: Implement Computation Logic Use TileLang primitives for data movement and computation: ```python # Initialize accumulator T.clear(C_local) # Main computation loop with software pipelining for k in T.Pipelined(T.ceildiv(K, block_K), num_stages=3): # Load tiles from global to shared memory T.copy(A[by * block_M, k * block_K], A_shared) T.copy(B[k * block_K, bx * block_N], B_shared) # Compute using Tensor Cores T.gemm(A_shared, B_shared, C_local, transpose_B=False) # Write results back T.copy(C_local, C[by * block_M, bx * block_N]) ``` **Key primitives:** - `T.copy`: Intelligent data transfer (auto-selects cp.async, TMA, etc.) - `T.gemm`: Matrix multiplication using Tensor Cores - `T.Pipelined`: Software pipelining to overlap compute and memory transfer - `T.Parallel`: Element-wise parallel operations **Pipeline stages:** - `num_stages=2`: Double buffering - `num_stages=3`: Triple buffering (recommended for most workloads) - `num_stages=4+`: Diminishing returns, increases shared memory usage ### Step 5: Validate and Test Generate test code to verify correctness: ```python # Example instantiation func = kernel_name( M=1024, N=1024, K=1024, block_M=128, block_N=128, block_K=32 ) # Test against reference implementation import torch A = torch.randn(1024, 1024, dtype=torch.float16, device='cuda') B = torch.randn(1024, 1024, dtype=torch.float16, device='cuda') C_tilelang = torch.empty(1024, 1024, dtype=torch.float16, device='cuda') C_reference = A @ B func(A, B, C_tilelang) # Verify with appropriate tolerance for FP16 torch.testing.assert_close(C_tilelang, C_reference, rtol=1e-3, atol=1e-3) ``` ### Step 6: Optimize Performance Apply advanced optimizations if performance is suboptimal: **Block Size Tuning:** - A100: Try 128×128×32 or 64×64×32 - H100: Can use larger blocks (256×128×32) - MI300X: May need smaller blocks due to 64KB shared memory limit **Pipeline Depth:** - Increase `num_stages` if memory-bound - Decrease if shared memory is exhausted **Warp Policy (for advanced cases):** ```python T.gemm(A, B, C, policy=T.GemmWarpPolicy.FullRow) # For attention T.gemm(A, B, C, policy=T.GemmWarpPolicy.FullCol) # For MLA decode ``` **Block-level swizzle:** ```python T.use_swizzle(panel_size=10) # Improves L2 cache hit rate ``` ## Common Kernel Patterns ### Matrix Multiplication (GEMM) Most fundamental kernel. See [EXAMPLES.md](references/EXAMPLES.md#matrix-multiplication-gemm) for complete implementation. **Key features:** - 3-stage pipelining - Swizzle layout for shared memory - Float32 accumulator for precision ### FlashAttention Memory-efficient attention with online softmax. See [EXAMPLES.md](references/EXAMPLES.md#flashattention-v2) for complete implementation. **Key features:** - O(N) memory complexity - Online softmax statistics - Fused kernel (no intermediate materialization) ### DeepSeek MLA Multi-Head Latent Attention with KV compression. See [EXAMPLES.md](references/EXAMPLES.md#deepseek-mla-decoding) for complete implementation. **Key features:** - Split-KV parallelization - Non-standard dimensions - FullCol warp policy for narrow matrices ## Reference Documentation When you need specific information: - **API details** (parameters, signatures, options): Read [API_REFERENCE.md](references/API_REFERENCE.md) - **Complete code examples** (GEMM, Attention, MLA): Read [EXAMPLES.md](references/EXAMPLES.md) - **Troubleshooting** (errors, performance issues): Read [DEBUGGING.md](references/DEBUGGING.md) ## Critical Performance Guidelines Always include these optimizations: 1. **Swizzle layout for shared memory:** ```python T.annotate_layout({ A_shared: T.make_swizzled_layout(A_shared) }) ``` 2. **Software pipelining:** ```python for k in T.Pipelined(num_blocks, num_stages=3): ``` 3. **Float32 accumulators:** ```python C_local = T.alloc_fragment((M, N), "float32") # Not float16 ``` 4. **Aligned block_K:** ```python block_K = 32 # Or 16, must align for Tensor Core ``` 5. **Initialize accumulators:** ```python T.clear(C_local) ``` ## Hardware-Specific Considerations ### NVIDIA GPUs - **Ampere (A100)**: Use cp.async, num_stages=3, block_K=32 - **Hopper (H100)**: Can use TMA, larger blocks (256×128), num_stages=4 - Shared memory: 164KB (A100), 228KB (H100) ### AMD GPUs - **MI300X**: Use target="hip", smaller blocks, 64KB shared memory limit - Test with both HIP and CUDA backends for compatibility ### Huawei Ascend - More experimental backend - May require specific block sizes - Consult Ascend-specific documentation ## Code Quality Standards When generating TileLang code: 1. **Add explanatory comments** for non-obvious optimizations 2. **Specify hardware assumptions** (e.g., "optimized for A100") 3. **Include usage examples** showing instantiation 4. **Document block size choices** and tuning rationale 5. **Provide performance expectations** (e.g., "~90% of cuBLAS") ## Example Kernel Request Flow **User:** "Write a FP16 matrix multiplication kernel for A100" **Response:** 1. Clarify dimensions (if not specified) 2. Generate complete kernel code with: - Proper structure (@tilelang.jit, @T.prim_func) - Swizzle layouts - 3-stage pipelining - Appropriate block sizes (128×128×32) 3. Add usage example 4. Explain key optimizations: - "Swizzle layout eliminates bank conflicts" - "3-stage pipeline overlaps memory and compute" - "Float32 accumulator prevents overflow" 5. Suggest testing approach ## Troubleshooting Quick Reference **Compilation errors:** - Shared memory exceeded → Reduce block size or num_stages - Shape mismatch → Verify dimension alignment in T.gemm **Runtime errors:** - Results all zeros → Check T.clear() and out_idx in decorator - NaN/Inf → Use float32 accumulator, add epsilon in division **Performance issues:** - Low throughput → Verify swizzle layout and pipelining enabled - Low occupancy → Reduce shared memory usage or block size - Bank conflicts → Apply T.make_swizzled_layout For detailed solutions, consult [DEBUGGING.md](references/DEBUGGING.md).