---
name: gpu-optimizer
description: Expert GPU optimization for modern consumer GPUs (8-24GB VRAM). Use this skill when you need to optimize GPU training, speed up CUDA code, reduce OOM errors, tune XGBoost for GPU, migrate NumPy to CuPy, make a model faster, manage GPU memory, optimize VRAM usage, or benchmark PyTorch. Covers mixed precision, gradient checkpointing, XGBoost GPU acceleration, CuPy/cuDF migration, vectorization, torch.compile, and diagnostics. NVIDIA GPUs only. PyTorch, XGBoost, and RAPIDS frameworks.
metadata:
  version: 1.1.0
---

# GPU Optimizer

Expert GPU optimization for consumer GPUs with 8–24GB VRAM. Evidence-based patterns only.

## Hardware Profile

Fill in your hardware before applying optimizations:

| Property          | Your Value                                       |
| ----------------- | ------------------------------------------------ |
| GPU model         | (e.g., RTX 4080 Mobile, RTX 3090, RTX 4090)      |
| VRAM              | (e.g., 12GB, 16GB, 24GB)                         |
| CUDA version      | (`nvidia-smi` → top-right)                       |
| TDP / power limit | (laptop vs desktop affects sustained throughput) |
| Driver version    | (`nvidia-smi` → top-left)                        |

Key constraint: VRAM capacity determines which strategies apply. Patterns below are annotated with minimum VRAM requirements where relevant.

## Optimization Categories

### 1. XGBoost GPU Acceleration

**DMatrix vs QuantileDMatrix:**

```python
# GPU-optimized: QuantileDMatrix is 1.8x faster
dtrain = xgb.QuantileDMatrix(X_train.astype(np.float32))
dval = xgb.QuantileDMatrix(X_val.astype(np.float32))

# Standard: DMatrix (use for inference only)
dtest = xgb.DMatrix(X_test.astype(np.float32))
```

**Critical Parameters:**

```python
params = {
    'tree_method': 'hist',        # GPU-accelerated histogram
    'device': 'cuda:0',           # Explicit GPU device
    'max_bin': 256,               # Higher bins = better splits (VRAM permitting)
    'grow_policy': 'depthwise',   # vs 'lossguide' for imbalanced data
    'predictor': 'gpu_predictor', # GPU inference
}

# Training with explicit device
model = xgb.train(params, dtrain, num_boost_round=100)
```

**GPU Verification (fail-fast):**

```python
def verify_gpu():
    """Verify XGBoost GPU availability. Raises if unavailable."""
    import subprocess
    try:
        result = subprocess.run(["nvidia-smi"], capture_output=True, text=True)
        if result.returncode != 0:
            raise RuntimeError("nvidia-smi failed - no GPU available")
    except FileNotFoundError:
        raise RuntimeError("nvidia-smi not found - no GPU available")

    build_info = xgb.build_info()
    if not build_info.get("USE_CUDA"):
        raise RuntimeError("XGBoost not compiled with CUDA support")
```

**Memory Management:**

```python
# Single-pass training (reuse QuantileDMatrix across slots)
dtrain = xgb.QuantileDMatrix(X_train.astype(np.float32))
for slot_idx in range(num_slots):
    dtrain.set_label(y_train[:, slot_idx])  # Reuse matrix
    model = xgb.train(params, dtrain, num_boost_round=100)
```

### 2. PyTorch Mixed Precision

**BF16 (preferred) vs FP16:**

```python
from torch.amp import autocast, GradScaler

# Auto-detect best precision
if torch.cuda.is_bf16_supported():
    amp_dtype = torch.bfloat16  # Ampere+ GPUs support BF16
else:
    amp_dtype = torch.float16

# Training step
scaler = GradScaler('cuda') if amp_dtype == torch.float16 else None

with autocast('cuda', dtype=amp_dtype):
    output = model(input_ids, attention_mask)
    loss = criterion(output, targets)

# Backward with scaling (FP16 only)
if scaler:
    scaler.scale(loss).backward()
    scaler.step(optimizer)
    scaler.update()
else:
    loss.backward()
    optimizer.step()
```

**Why BF16 > FP16:**

- Same exponent range as FP32 (no overflow/underflow)
- No GradScaler needed (simpler code)
- Ampere and later GPUs have native BF16 Tensor cores

### 3. VRAM Management

**Gradient Checkpointing:**

```python
# Saves ~40% VRAM, adds ~20% compute time
model.gradient_checkpointing_enable()

# For transformers:
model.base_model.model.gradient_checkpointing_enable()
```

**VRAM Monitoring:**

```python
import torch

torch.cuda.reset_peak_memory_stats()
# ... training ...
peak_vram_gb = torch.cuda.max_memory_allocated() / 1024**3
print(f"Peak VRAM: {peak_vram_gb:.2f} GB")

# Clear cache between experiments
torch.cuda.empty_cache()
```

**Gradient Accumulation:**

```python
# Simulate larger batch size without OOM
grad_accum_steps = max(1, target_batch_size // actual_batch_size)

for i, batch in enumerate(dataloader):
    loss = model(batch) / grad_accum_steps
    loss.backward()

    if (i + 1) % grad_accum_steps == 0:
        optimizer.step()
        optimizer.zero_grad()
```

**DoE for VRAM Optimization:**

```python
EXPERIMENTS = [
    {"batch_size": 2,  "seq_len": 128, "grad_ckpt": True,  "amp": "bf16"},
    {"batch_size": 4,  "seq_len": 256, "grad_ckpt": True,  "amp": "bf16"},
    {"batch_size": 8,  "seq_len": 512, "grad_ckpt": False, "amp": "bf16"},
    {"batch_size": 16, "seq_len": 256, "grad_ckpt": False, "amp": "bf16"},
]
```

### 4. Aggressive Vectorization

**Tensor Lookups (not Python loops):**

```python
# Slow: Python loop
for i, token_id in enumerate(input_ids):
    type_id = token_to_type[token_id]
    embeddings[i] = type_embeddings[type_id]

# Fast: Vectorized
type_ids = token_to_type[input_ids]  # Broadcast lookup
embeddings = type_embeddings[type_ids]  # Single GPU kernel
```

**Registered Buffers (persistent GPU data):**

```python
class Model(nn.Module):
    def __init__(self):
        super().__init__()
        # Build lookup tensors once
        type_ids = torch.zeros(vocab_size, dtype=torch.long)
        self.register_buffer('_type_ids', type_ids)  # Stays on GPU

    def forward(self, input_ids):
        return self._type_ids[input_ids]  # Vectorized lookup
```

**Batch Operations:**

```python
# Slow: Per-sample processing
outputs = [model(x.unsqueeze(0)) for x in batch]

# Fast: Batched
outputs = model(batch)  # Single forward pass
```

### 5. CuPy Migration (NumPy → GPU)

**When to Use CuPy:**

- Large array operations (>1M elements)
- Repeated NumPy calls in tight loops
- Preprocessing pipelines before PyTorch/XGBoost

**Migration Pattern:**

```python
import cupy as cp
import numpy as np

# NumPy (CPU)
x = np.random.randn(10000, 1000)
y = np.dot(x, x.T)

# CuPy (GPU) - SAME API
x_gpu = cp.random.randn(10000, 1000)
y_gpu = cp.dot(x_gpu, x_gpu.T)

# Transfer back if needed
y_cpu = cp.asnumpy(y_gpu)
```

**Interop with PyTorch:**

```python
# CuPy → PyTorch (zero-copy)
x_cupy = cp.random.randn(1000, 1000)
x_torch = torch.as_tensor(x_cupy, device='cuda')

# PyTorch → CuPy (zero-copy)
x_torch = torch.randn(1000, 1000, device='cuda')
x_cupy = cp.asarray(x_torch)
```

**Install:**

```bash
uv pip install cupy-cuda12x  # For CUDA 12.x
```

### 6. cuDF Migration (Pandas → GPU)

**When to Use cuDF:**

- DataFrames >1GB
- Groupby/aggregation on large data
- ETL pipelines before model training

**Migration Pattern:**

```python
import cudf
import pandas as pd

# Pandas (CPU)
df = pd.read_csv('large.csv')
grouped = df.groupby('category')['value'].mean()

# cuDF (GPU) - SAME API
df_gpu = cudf.read_csv('large.csv')
grouped_gpu = df_gpu.groupby('category')['value'].mean()

# Transfer back
grouped_cpu = grouped_gpu.to_pandas()
```

**XGBoost Integration:**

```python
import cudf
import xgboost as xgb

# Load data on GPU
df = cudf.read_csv('train.csv')
X = df[feature_cols]
y = df['target']

# Create DMatrix directly from cuDF (no CPU copy)
dtrain = xgb.DMatrix(X, label=y)
```

**Install:**

```bash
# RAPIDS (includes cuDF, cuML, cuGraph)
uv pip install cudf-cu12 --extra-index-url=https://pypi.nvidia.com
```

### 7. PyTorch Compilation & Optimization

**Fused Optimizer:**

```python
# Check availability
use_fused = (
    torch.cuda.is_available()
    and "fused" in torch.optim.AdamW.__init__.__code__.co_varnames
)

optimizer = torch.optim.AdamW(
    model.parameters(),
    lr=1e-3,
    fused=use_fused,  # Single GPU kernel (2-3x faster)
)
```

**Torch Compile:**

```python
# PyTorch 2.0+ compile
if hasattr(torch, "compile"):
    model = torch.compile(model, mode="reduce-overhead")
```

**cuDNN Benchmarking:**

```python
# Auto-tune kernels (slower startup, faster training)
torch.backends.cudnn.benchmark = True

# Disable for determinism
torch.backends.cudnn.deterministic = True
```

### 8. Advanced Loss Functions

**Weighted Slot Loss:**

```python
class WeightedSlotLoss(nn.Module):
    def __init__(self, slot_weights):
        super().__init__()
        self.slot_weights = torch.tensor(slot_weights)

    def forward(self, logits_list, targets):
        weighted_losses = []
        for i, logits in enumerate(logits_list):
            loss = F.cross_entropy(logits, targets[:, i])
            weighted_losses.append(loss * self.slot_weights[i])
        return torch.stack(weighted_losses).sum() / self.slot_weights.sum()
```

**Focal Loss (hard example mining):**

```python
class FocalLoss(nn.Module):
    def __init__(self, gamma=2.0):
        super().__init__()
        self.gamma = gamma

    def forward(self, logits, targets):
        ce_loss = F.cross_entropy(logits, targets, reduction='none')
        pt = torch.exp(-ce_loss)
        focal_loss = ((1 - pt) ** self.gamma) * ce_loss
        return focal_loss.mean()
```

### 9. Caching & Precomputation

**Position Embedding Cache:**

```python
class Model(nn.Module):
    def __init__(self):
        super().__init__()
        self._pos_cache = {}  # {seq_len: positions}

    def forward(self, x):
        T = x.size(1)
        if T not in self._pos_cache:
            self._pos_cache[T] = torch.arange(T, device=x.device)
            # Limit cache size
            if len(self._pos_cache) > 10:
                self._pos_cache.pop(next(iter(self._pos_cache)))
        return self.pos_embed(self._pos_cache[T])
```

**Attention Mask Cache:**

```python
def _create_causal_mask(self, T, device):
    if T not in self._mask_cache:
        mask = torch.triu(torch.ones(T, T), diagonal=1).bool()
        self._mask_cache[T] = mask.to(device)
    return self._mask_cache[T]
```

## Quick Diagnostics

**Check GPU Utilization:**

```bash
watch -n 1 nvidia-smi  # Monitor in real-time
```

**Profile PyTorch:**

```python
with torch.profiler.profile(
    activities=[torch.profiler.ProfilerActivity.GPU],
    with_stack=True,
) as prof:
    model(batch)

print(prof.key_averages().table(sort_by="cuda_time_total"))
```

**Bottleneck Detection:**

```python
import torch.utils.bottleneck as bottleneck
bottleneck.main(['script.py'])
```

## Migration Checklist

- [ ] **XGBoost**: Use `QuantileDMatrix`, set `device='cuda:0'`
- [ ] **PyTorch**: Enable BF16/FP16, fused optimizer, torch.compile
- [ ] **VRAM**: Gradient checkpointing if approaching VRAM limit
- [ ] **NumPy→CuPy**: For preprocessing >1M elements
- [ ] **Pandas→cuDF**: For DataFrames >1GB
- [ ] **Vectorization**: Replace Python loops with tensor ops
- [ ] **Caching**: Precompute positions, masks, embeddings
- [ ] **Monitor**: Track VRAM usage, profile GPU kernels

## Anti-Patterns

**Avoid:**

- Using `.cpu()` in training loop (kills GPU pipeline)
- Creating tensors on CPU then moving to GPU (create on GPU directly)
- Using Python loops over tensors (vectorize)
- Ignoring VRAM monitoring (leads to OOM crashes)
- Using FP32 when BF16/FP16 works (wastes bandwidth)
- Calling `torch.cuda.synchronize()` unnecessarily (breaks async)

## References

**Documentation:**

- XGBoost GPU: https://xgboost.readthedocs.io/en/stable/gpu/
- PyTorch AMP: https://pytorch.org/docs/stable/amp.html
- CuPy: https://docs.cupy.dev/en/stable/
- cuDF: https://docs.rapids.ai/api/cudf/stable/

## Error Handling

- CUDA not available at runtime: run `nvidia-smi` first to confirm the GPU is visible; if the command fails, verify driver installation with `sudo nvidia-smi` or reinstall drivers before proceeding.
- XGBoost raises `RuntimeError: XGBoost not compiled with CUDA support`: install the CUDA build via `uv pip install xgboost` from a CUDA-enabled environment, or build from source with `-DUSE_CUDA=ON`.
- OOM during training: reduce batch size first (halve it), then enable gradient checkpointing; if OOM persists after both, enable gradient accumulation to simulate the original batch size.
- CuPy import failure (`ImportError` or version mismatch): verify CUDA toolkit version with `nvcc --version` and install the matching CuPy wheel (e.g., `cupy-cuda12x` for CUDA 12.x).
- cuDF install fails or produces CUDA version errors: use the NVIDIA PyPI index (`--extra-index-url=https://pypi.nvidia.com`) and match the `cudf-cu12` suffix to your CUDA major version.
- `torch.compile` produces incorrect results or crashes: disable with `model = model` (no compile) to isolate; known to fail on some custom ops — fall back to eager mode for those layers.

## Limitations

- NVIDIA GPUs only — AMD (ROCm) and Intel Arc GPUs are not covered by these patterns.
- Assumes a single-GPU setup; multi-GPU (DDP, FSDP) requires additional configuration not covered here.
- Patterns are calibrated for consumer GPUs (8–24GB VRAM); datacenter GPUs (A100, H100) have different memory hierarchies and may benefit from different strategies.
- Framework coverage: PyTorch, XGBoost, and RAPIDS (CuPy/cuDF) only — JAX, TensorFlow, and MXNet are out of scope.
- Laptop GPU TDP limits sustained throughput; power-throttled performance can differ significantly from desktop benchmarks even at the same VRAM capacity.

## Output Format

Each optimization recommendation includes a before/after code pair showing the original pattern and the GPU-optimized equivalent.
Performance gain estimates are provided as ranges (e.g., "1.8x faster", "~40% VRAM reduction") based on typical consumer GPU benchmarks — actual gains depend on workload and hardware.
Where a change introduces a trade-off (e.g., gradient checkpointing adds compute time), the trade-off is stated explicitly inline.