--- name: debug:pytorch description: Debug PyTorch issues systematically. Use when encountering tensor errors, CUDA out of memory errors, gradient problems like NaN loss or exploding gradients, shape mismatches between layers, device conflicts between CPU and GPU, autograd graph issues, DataLoader problems, dtype mismatches, or training instabilities in deep learning workflows. --- # PyTorch Debugging Guide This guide provides systematic approaches to debugging PyTorch models, from common tensor errors to complex training issues. ## Common Error Patterns ### 1. CUDA Out of Memory (OOM) **Error Message:** ``` RuntimeError: CUDA out of memory. Tried to allocate X.XX GiB ``` **Causes:** - Batch size too large for GPU memory - Accumulating gradients without clearing - Storing tensors on GPU unnecessarily - Memory leaks from not detaching tensors **Solutions:** ```python # Check current memory usage print(torch.cuda.memory_summary(device=None, abbreviated=False)) print(f"Allocated: {torch.cuda.memory_allocated() / 1e9:.2f} GB") print(f"Cached: {torch.cuda.memory_reserved() / 1e9:.2f} GB") # Clear cache torch.cuda.empty_cache() # Reduce batch size batch_size = batch_size // 2 # Use gradient checkpointing for large models from torch.utils.checkpoint import checkpoint output = checkpoint(self.heavy_layer, input) # Use mixed precision training from torch.cuda.amp import autocast, GradScaler scaler = GradScaler() with autocast(): output = model(input) loss = criterion(output, target) scaler.scale(loss).backward() scaler.step(optimizer) scaler.update() # Detach tensors when storing for logging logged_loss = loss.detach().cpu().item() # Use gradient accumulation instead of large batches accumulation_steps = 4 for i, (inputs, labels) in enumerate(dataloader): outputs = model(inputs) loss = criterion(outputs, labels) / accumulation_steps loss.backward() if (i + 1) % accumulation_steps == 0: optimizer.step() optimizer.zero_grad() ``` ### 2. Tensor Size/Shape Mismatch **Error Message:** ``` RuntimeError: size mismatch, m1: [32 x 512], m2: [256 x 10] RuntimeError: The size of tensor a (64) must match the size of tensor b (32) ``` **Causes:** - Incorrect layer dimensions - Wrong tensor reshaping - Mismatched batch sizes - Incorrect input preprocessing **Solutions:** ```python # Debug by printing shapes at each layer class DebugModel(nn.Module): def forward(self, x): print(f"Input shape: {x.shape}") x = self.layer1(x) print(f"After layer1: {x.shape}") x = self.layer2(x) print(f"After layer2: {x.shape}") return x # Add shape assertions as contracts def forward(self, x): assert x.dim() == 4, f"Expected 4D input, got {x.dim()}D" assert x.shape[1] == 3, f"Expected 3 channels, got {x.shape[1]}" # ... rest of forward pass # Use einops for clearer reshaping from einops import rearrange x = rearrange(x, 'b c h w -> b (c h w)') # Calculate dimensions programmatically def _get_conv_output_size(self, shape): with torch.no_grad(): dummy = torch.zeros(1, *shape) output = self.conv_layers(dummy) return output.numel() ``` ### 3. NaN in Gradients/Loss **Error Message:** ``` Loss is nan RuntimeError: Function 'XXXBackward' returned nan values ``` **Causes:** - Learning rate too high - Numerical instability in operations - Division by zero - Log of zero or negative numbers - Exploding gradients **Solutions:** ```python # Enable anomaly detection to find the source torch.autograd.set_detect_anomaly(True) # Check for NaN in tensors def check_nan(tensor, name="tensor"): if torch.isnan(tensor).any(): print(f"NaN detected in {name}") print(f"Shape: {tensor.shape}") print(f"NaN count: {torch.isnan(tensor).sum()}") raise ValueError(f"NaN in {name}") # Add epsilon for numerical stability eps = 1e-8 log_probs = torch.log(probs + eps) normalized = x / (x.norm() + eps) # Gradient clipping torch.nn.utils.clip_grad_norm_(model.parameters(), max_norm=1.0) torch.nn.utils.clip_grad_value_(model.parameters(), clip_value=1.0) # Check gradients after backward def check_gradients(model): for name, param in model.named_parameters(): if param.grad is not None: if torch.isnan(param.grad).any(): print(f"NaN gradient in {name}") if torch.isinf(param.grad).any(): print(f"Inf gradient in {name}") grad_norm = param.grad.norm() print(f"{name}: grad_norm = {grad_norm:.4f}") # Use stable loss functions # BAD: nn.CrossEntropyLoss on softmax output # GOOD: nn.CrossEntropyLoss on logits (raw scores) loss = nn.CrossEntropyLoss()(logits, targets) # Not softmax(logits) ``` ### 4. Device Mismatch (CPU/GPU) **Error Message:** ``` RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cuda:0 and cpu! RuntimeError: Input type (torch.cuda.FloatTensor) and weight type (torch.FloatTensor) should be the same ``` **Causes:** - Model on GPU, data on CPU (or vice versa) - Loading model saved on different device - Creating new tensors without specifying device - Mixing tensors from different GPUs **Solutions:** ```python # Always explicitly move to device device = torch.device('cuda' if torch.cuda.is_available() else 'cpu') model = model.to(device) inputs = inputs.to(device) targets = targets.to(device) # Load model with map_location model.load_state_dict(torch.load('model.pt', map_location=device)) # Create tensors on the correct device new_tensor = torch.zeros(10, device=device) new_tensor = torch.zeros_like(existing_tensor) # Same device as existing # Check device of all model parameters def check_model_device(model): devices = {p.device for p in model.parameters()} print(f"Model parameters on devices: {devices}") # Debug device issues print(f"Model device: {next(model.parameters()).device}") print(f"Input device: {inputs.device}") print(f"Target device: {targets.device}") ``` ### 5. Autograd Graph Issues **Error Message:** ``` RuntimeError: Trying to backward through the graph a second time RuntimeError: one of the variables needed for gradient computation has been modified by an inplace operation RuntimeError: element 0 of tensors does not require grad and does not have a grad_fn ``` **Causes:** - Calling backward() twice without retain_graph - In-place operations on tensors requiring gradients - Detaching tensors incorrectly - Not enabling gradients on input tensors **Solutions:** ```python # For multiple backward passes loss.backward(retain_graph=True) # Use sparingly - memory intensive # Avoid in-place operations on tensors with gradients # BAD: x += 1 x[0] = 0 x.add_(1) # GOOD: x = x + 1 x = x.clone() x[0] = 0 # Ensure requires_grad is set input_tensor = torch.randn(10, requires_grad=True) # Clone before in-place modification x_modified = x.clone() x_modified[0] = 0 # Check if tensor has gradient function print(f"requires_grad: {tensor.requires_grad}") print(f"grad_fn: {tensor.grad_fn}") print(f"is_leaf: {tensor.is_leaf}") # Properly detach for logging/storage logged_value = tensor.detach().cpu().numpy() ``` ### 6. DataLoader Problems **Error Message:** ``` RuntimeError: DataLoader worker (pid X) is killed BrokenPipeError: [Errno 32] Broken pipe RuntimeError: Cannot re-initialize CUDA in forked subprocess ``` **Causes:** - Too many workers - Memory issues in workers - CUDA operations before DataLoader fork - Shared memory issues **Solutions:** ```python # Reduce number of workers dataloader = DataLoader(dataset, batch_size=32, num_workers=2) # Use spawn instead of fork for CUDA import multiprocessing multiprocessing.set_start_method('spawn', force=True) # Pin memory for faster GPU transfer (but uses more memory) dataloader = DataLoader(dataset, pin_memory=True) # Increase shared memory for Docker # In docker-compose.yml: shm_size: '2gb' # Debug DataLoader issues dataloader = DataLoader(dataset, num_workers=0) # Single process for debugging # Use persistent workers to avoid respawning dataloader = DataLoader(dataset, num_workers=4, persistent_workers=True) # Custom collate function with error handling def safe_collate(batch): try: return torch.utils.data.dataloader.default_collate(batch) except Exception as e: print(f"Collate error: {e}") print(f"Batch: {batch}") raise ``` ### 7. Dtype Mismatch **Error Message:** ``` RuntimeError: expected scalar type Float but found Double RuntimeError: expected scalar type Long but found Int ``` **Causes:** - Mixing float32 and float64 - Wrong dtype for loss functions - NumPy default dtype conflicts **Solutions:** ```python # Explicitly set dtype tensor = torch.tensor(data, dtype=torch.float32) tensor = tensor.float() # Convert to float32 tensor = tensor.long() # Convert to int64 # Set default dtype globally torch.set_default_dtype(torch.float32) # CrossEntropyLoss expects Long targets targets = targets.long() # BCELoss expects Float targets targets = targets.float() # Convert NumPy arrays properly numpy_array = np.array([1.0, 2.0]) # float64 by default tensor = torch.from_numpy(numpy_array).float() # Convert to float32 ``` ## Debugging Tools ### 1. Anomaly Detection ```python # Enable for debugging (disable in production - slow) torch.autograd.set_detect_anomaly(True) # Use as context manager with torch.autograd.detect_anomaly(): output = model(input) loss = criterion(output, target) loss.backward() ``` ### 2. Python Debugger ```python # Insert breakpoint breakpoint() # Python 3.7+ import pdb; pdb.set_trace() # Older Python # In Jupyter from IPython.core.debugger import set_trace set_trace() # Common pdb commands: # n - next line # s - step into # c - continue # p variable - print variable # pp variable - pretty print # l - list source code # q - quit ``` ### 3. Memory Profiling ```python # GPU memory summary print(torch.cuda.memory_summary()) # Memory snapshots torch.cuda.memory._record_memory_history() # ... run code ... torch.cuda.memory._dump_snapshot("memory_snapshot.pickle") # Profile memory allocation from torch.profiler import profile, ProfilerActivity with profile(activities=[ProfilerActivity.CPU, ProfilerActivity.CUDA], profile_memory=True) as prof: model(input) print(prof.key_averages().table(sort_by="self_cuda_memory_usage")) ``` ### 4. TensorBoard Integration ```python from torch.utils.tensorboard import SummaryWriter writer = SummaryWriter('runs/experiment_1') # Log scalars writer.add_scalar('Loss/train', loss.item(), epoch) writer.add_scalar('Accuracy/val', accuracy, epoch) # Log histograms of weights and gradients for name, param in model.named_parameters(): writer.add_histogram(f'weights/{name}', param, epoch) if param.grad is not None: writer.add_histogram(f'gradients/{name}', param.grad, epoch) # Log model graph writer.add_graph(model, input_tensor) # Log images writer.add_images('predictions', predicted_images, epoch) writer.close() # Launch: tensorboard --logdir=runs ``` ### 5. PyTorch Profiler ```python from torch.profiler import profile, record_function, ProfilerActivity with profile( activities=[ProfilerActivity.CPU, ProfilerActivity.CUDA], record_shapes=True, profile_memory=True, with_stack=True ) as prof: with record_function("model_inference"): model(input) # Print summary print(prof.key_averages().table(sort_by="cuda_time_total", row_limit=10)) # Export for Chrome trace viewer prof.export_chrome_trace("trace.json") # Export for TensorBoard prof.export_stacks("profiler_stacks.txt", "self_cuda_time_total") ``` ## The Four Phases of PyTorch Debugging ### Phase 1: Reproduce and Isolate ```python # Set seeds for reproducibility def set_seed(seed=42): torch.manual_seed(seed) torch.cuda.manual_seed_all(seed) np.random.seed(seed) random.seed(seed) torch.backends.cudnn.deterministic = True torch.backends.cudnn.benchmark = False set_seed(42) # Create minimal reproducible example # Isolate the problematic component # Test with synthetic data first dummy_input = torch.randn(2, 3, 224, 224, device=device) dummy_target = torch.randint(0, 10, (2,), device=device) # Run single forward/backward pass model.train() output = model(dummy_input) loss = criterion(output, dummy_target) loss.backward() ``` ### Phase 2: Validate Data Pipeline ```python # Check dataset print(f"Dataset size: {len(dataset)}") sample = dataset[0] print(f"Sample type: {type(sample)}") print(f"Sample shapes: {[s.shape if hasattr(s, 'shape') else type(s) for s in sample]}") # Check DataLoader output batch = next(iter(dataloader)) for i, item in enumerate(batch): print(f"Batch item {i}: shape={item.shape}, dtype={item.dtype}, device={item.device}") # Validate data ranges inputs, targets = batch print(f"Input range: [{inputs.min():.4f}, {inputs.max():.4f}]") print(f"Input mean: {inputs.mean():.4f}, std: {inputs.std():.4f}") print(f"Target unique values: {targets.unique()}") # Check for data issues assert not torch.isnan(inputs).any(), "NaN in inputs" assert not torch.isinf(inputs).any(), "Inf in inputs" ``` ### Phase 3: Validate Model Architecture ```python # Test with tiny data to check for bugs def overfit_single_batch(model, batch, epochs=100): """Model should be able to overfit a single batch.""" model.train() inputs, targets = batch optimizer = torch.optim.Adam(model.parameters(), lr=1e-3) for epoch in range(epochs): optimizer.zero_grad() outputs = model(inputs) loss = criterion(outputs, targets) loss.backward() optimizer.step() if epoch % 10 == 0: print(f"Epoch {epoch}: Loss = {loss.item():.6f}") # Loss should be very low if model can learn assert loss.item() < 0.1, "Model cannot overfit single batch!" # Check model parameters def inspect_model(model): total_params = 0 trainable_params = 0 for name, param in model.named_parameters(): total_params += param.numel() if param.requires_grad: trainable_params += param.numel() print(f"{name}: shape={param.shape}, requires_grad={param.requires_grad}") print(f"\nTotal params: {total_params:,}") print(f"Trainable params: {trainable_params:,}") # Verify forward pass shape transformations def trace_shapes(model, input_shape): """Trace shapes through the model.""" hooks = [] shapes = [] def hook(module, input, output): shapes.append({ 'module': module.__class__.__name__, 'input': [i.shape for i in input if hasattr(i, 'shape')], 'output': output.shape if hasattr(output, 'shape') else type(output) }) for layer in model.modules(): hooks.append(layer.register_forward_hook(hook)) dummy = torch.randn(1, *input_shape) model(dummy) for h in hooks: h.remove() for s in shapes: print(s) ``` ### Phase 4: Validate Training Loop ```python # Comprehensive training loop debugging def debug_training_step(model, batch, criterion, optimizer): model.train() inputs, targets = batch # Check inputs print(f"Input shape: {inputs.shape}, dtype: {inputs.dtype}") print(f"Target shape: {targets.shape}, dtype: {targets.dtype}") # Zero gradients optimizer.zero_grad() # Forward pass with torch.autograd.detect_anomaly(): outputs = model(inputs) print(f"Output shape: {outputs.shape}") print(f"Output range: [{outputs.min():.4f}, {outputs.max():.4f}]") # Check for NaN in outputs if torch.isnan(outputs).any(): print("WARNING: NaN in outputs!") # Compute loss loss = criterion(outputs, targets) print(f"Loss: {loss.item():.6f}") if torch.isnan(loss): print("WARNING: NaN loss!") return # Backward pass loss.backward() # Check gradients for name, param in model.named_parameters(): if param.grad is not None: grad_norm = param.grad.norm().item() print(f"{name}: grad_norm = {grad_norm:.6f}") if torch.isnan(param.grad).any(): print(f" WARNING: NaN gradient in {name}!") # Optimizer step optimizer.step() return loss.item() ``` ## Quick Reference Commands ### Environment and Device Checks ```python # PyTorch version print(f"PyTorch version: {torch.__version__}") print(f"CUDA available: {torch.cuda.is_available()}") print(f"CUDA version: {torch.version.cuda}") print(f"cuDNN version: {torch.backends.cudnn.version()}") # GPU information if torch.cuda.is_available(): print(f"GPU count: {torch.cuda.device_count()}") print(f"Current GPU: {torch.cuda.current_device()}") print(f"GPU name: {torch.cuda.get_device_name()}") print(f"GPU memory: {torch.cuda.get_device_properties(0).total_memory / 1e9:.2f} GB") ``` ### Gradient Checking ```python # Numerical gradient check from torch.autograd import gradcheck # For custom autograd functions input = torch.randn(10, requires_grad=True, dtype=torch.double) test = gradcheck(my_function, input, eps=1e-6, atol=1e-4) print(f"Gradient check passed: {test}") # For modules def check_module_gradients(module, input_shape): module = module.double() input = torch.randn(*input_shape, requires_grad=True, dtype=torch.double) return gradcheck(module, input, eps=1e-6, atol=1e-4) ``` ### Memory Management ```python # Clear GPU memory torch.cuda.empty_cache() # Memory stats print(f"Allocated: {torch.cuda.memory_allocated() / 1e9:.2f} GB") print(f"Cached: {torch.cuda.memory_reserved() / 1e9:.2f} GB") print(f"Max allocated: {torch.cuda.max_memory_allocated() / 1e9:.2f} GB") # Reset peak stats torch.cuda.reset_peak_memory_stats() # Find memory leaks import gc gc.collect() torch.cuda.empty_cache() ``` ### Model Inspection ```python # Count parameters total_params = sum(p.numel() for p in model.parameters()) trainable_params = sum(p.numel() for p in model.parameters() if p.requires_grad) print(f"Total: {total_params:,}, Trainable: {trainable_params:,}") # Model summary (requires torchsummary) from torchsummary import summary summary(model, input_size=(3, 224, 224)) # Export model to ONNX for visualization torch.onnx.export(model, dummy_input, "model.onnx", opset_version=11) ``` ### Tensor Debugging ```python # Comprehensive tensor info def tensor_info(t, name="tensor"): print(f"{name}:") print(f" shape: {t.shape}") print(f" dtype: {t.dtype}") print(f" device: {t.device}") print(f" requires_grad: {t.requires_grad}") print(f" is_leaf: {t.is_leaf}") print(f" grad_fn: {t.grad_fn}") print(f" min: {t.min().item():.6f}") print(f" max: {t.max().item():.6f}") print(f" mean: {t.mean().item():.6f}") print(f" std: {t.std().item():.6f}") print(f" has_nan: {torch.isnan(t).any().item()}") print(f" has_inf: {torch.isinf(t).any().item()}") ``` ## Best Practices Summary 1. **Always use explicit device placement** - Never assume tensors are on the right device 2. **Use loss functions on logits** - CrossEntropyLoss expects raw scores, not softmax output 3. **Register modules properly** - Use nn.ModuleList/ModuleDict for parameter detection 4. **Avoid in-place operations** - They can break autograd graphs 5. **Set seeds for reproducibility** - Makes debugging deterministic 6. **Start with small data** - Test overfitting on a single batch first 7. **Use anomaly detection** - Enable during development, disable in production 8. **Monitor gradients** - Check for NaN, Inf, and exploding/vanishing gradients 9. **Profile memory** - Use memory_summary() to identify leaks 10. **Print shapes liberally** - Most bugs are shape mismatches ## Resources - [UvA Deep Learning Notebooks - Debugging Guide](https://uvadlc-notebooks.readthedocs.io/en/latest/tutorial_notebooks/guide3/Debugging_PyTorch.html) - [PyTorch Lightning Debugging Guide](https://lightning.ai/docs/pytorch/stable/debug/debugging_basic.html) - [PyTorch Performance Tuning Guide](https://docs.pytorch.org/tutorials/recipes/recipes/tuning_guide.html) - [TorchRL Common Errors and Solutions](https://docs.pytorch.org/rl/stable/reference/generated/knowledge_base/PRO-TIPS.html) - [Machine Learning Mastery - Debugging PyTorch](https://machinelearningmastery.com/debugging-pytorch-machine-learning-models-a-step-by-step-guide/)