--- name: debug:tensorflow description: Debug TensorFlow and Keras issues systematically. This skill helps diagnose and resolve machine learning problems including tensor shape mismatches, GPU/CUDA detection failures, out-of-memory errors, NaN/Inf values in loss functions, vanishing/exploding gradients, SavedModel loading errors, and data pipeline bottlenecks. Provides tf.debugging assertions, TensorBoard profiling, eager execution debugging, and version compatibility guidance. --- # TensorFlow Debugging Guide This skill provides a systematic approach to debugging TensorFlow applications, covering common error patterns, debugging tools, and resolution strategies. ## Common Error Patterns ### 1. Shape Mismatch Errors **Symptoms:** - `InvalidArgumentError: Incompatible shapes` - `ValueError: Shapes (X,) and (Y,) are incompatible` - Matrix multiplication failures **Diagnostic Steps:** ```python # Print shapes at key points print(f"Input shape: {x.shape}") print(f"Expected shape: {model.input_shape}") # Use tf.debugging for assertions tf.debugging.assert_shapes([ (x, ('batch', 'features')), (y, ('batch', 'classes')) ]) # Enable eager execution for immediate shape inspection tf.config.run_functions_eagerly(True) ``` **Common Causes:** - Batch dimension mismatch (missing or extra dimension) - Incorrect reshape operations - Mismatched layer input/output dimensions - Broadcasting issues with incompatible shapes **Solutions:** ```python # Expand dimensions if needed x = tf.expand_dims(x, axis=0) # Add batch dimension # Reshape explicitly x = tf.reshape(x, [-1, height, width, channels]) # Use tf.ensure_shape for runtime validation x = tf.ensure_shape(x, [None, 224, 224, 3]) ``` ### 2. OOM (Out of Memory) Errors **Symptoms:** - `ResourceExhaustedError: OOM when allocating tensor` - `CUDA_ERROR_OUT_OF_MEMORY` - Training crashes after a few epochs **Diagnostic Steps:** ```python # Check GPU memory usage gpus = tf.config.list_physical_devices('GPU') if gpus: for gpu in gpus: details = tf.config.experimental.get_device_details(gpu) print(f"GPU: {gpu.name}, Details: {details}") # Monitor memory during training tf.debugging.experimental.enable_dump_debug_info( '/tmp/tfdbg2_logdir', tensor_debug_mode='FULL_HEALTH', circular_buffer_size=1000 ) ``` **Solutions:** ```python # Enable memory growth (prevent TF from allocating all GPU memory) gpus = tf.config.list_physical_devices('GPU') for gpu in gpus: tf.config.experimental.set_memory_growth(gpu, True) # Limit GPU memory tf.config.set_logical_device_configuration( gpus[0], [tf.config.LogicalDeviceConfiguration(memory_limit=4096)] # 4GB ) # Reduce batch size BATCH_SIZE = 16 # Try smaller values # Use gradient checkpointing for large models # (recompute activations during backward pass) # Clear session between runs tf.keras.backend.clear_session() # Use mixed precision training tf.keras.mixed_precision.set_global_policy('mixed_float16') ``` ### 3. NaN/Inf in Loss **Symptoms:** - Loss becomes `nan` or `inf` during training - Model predictions are all NaN - Gradient norm explodes **Diagnostic Steps:** ```python # Enable numeric checking tf.debugging.enable_check_numerics() # Check for NaN in tensors tf.debugging.check_numerics(tensor, "Tensor contains NaN or Inf") # Use TensorBoard Debugger V2 tf.debugging.experimental.enable_dump_debug_info( logdir='/tmp/tfdbg2_logdir', tensor_debug_mode='FULL_HEALTH', circular_buffer_size=1000 ) ``` **Common Causes:** - Learning rate too high - Exploding gradients - Log of zero or negative numbers - Division by zero - Incorrect loss function for data range **Solutions:** ```python # Reduce learning rate optimizer = tf.keras.optimizers.Adam(learning_rate=1e-5) # Add gradient clipping optimizer = tf.keras.optimizers.Adam(clipnorm=1.0) # or optimizer = tf.keras.optimizers.Adam(clipvalue=0.5) # Use numerically stable operations # Instead of: tf.math.log(x) tf.math.log(x + 1e-7) # Add epsilon # Instead of: x / y tf.math.divide_no_nan(x, y) # Add batch normalization model.add(tf.keras.layers.BatchNormalization()) # Check data for NaN before training assert not tf.reduce_any(tf.math.is_nan(train_data)).numpy() ``` ### 4. Gradient Issues **Symptoms:** - Vanishing gradients (weights not updating) - Exploding gradients (loss becomes NaN) - Training stalls, loss doesn't decrease **Diagnostic Steps:** ```python # Inspect gradients with GradientTape with tf.GradientTape() as tape: predictions = model(x, training=True) loss = loss_fn(y, predictions) gradients = tape.gradient(loss, model.trainable_variables) for var, grad in zip(model.trainable_variables, gradients): if grad is not None: print(f"{var.name}: grad_norm={tf.norm(grad).numpy():.6f}") else: print(f"{var.name}: NO GRADIENT (disconnected)") # Check for dead ReLUs activations = model.layers[5].output dead_neurons = tf.reduce_mean(tf.cast(activations <= 0, tf.float32)) ``` **Solutions:** ```python # For vanishing gradients # Use He initialization for ReLU networks initializer = tf.keras.initializers.HeNormal() # Use LeakyReLU instead of ReLU model.add(tf.keras.layers.LeakyReLU(alpha=0.1)) # Add residual connections (skip connections) # For exploding gradients # Apply gradient clipping gradients, _ = tf.clip_by_global_norm(gradients, 5.0) # Use proper weight initialization initializer = tf.keras.initializers.GlorotUniform() ``` ### 5. GPU Not Detected **Symptoms:** - `tf.config.list_physical_devices('GPU')` returns empty list - Training runs on CPU (slow) - CUDA errors on startup **Diagnostic Steps:** ```python # Check available devices print("Physical devices:", tf.config.list_physical_devices()) print("GPU devices:", tf.config.list_physical_devices('GPU')) print("Built with CUDA:", tf.test.is_built_with_cuda()) print("GPU available:", tf.test.is_gpu_available()) # Check CUDA/cuDNN versions import subprocess result = subprocess.run(['nvidia-smi'], capture_output=True, text=True) print(result.stdout) # Verify TensorFlow GPU package import tensorflow as tf print(tf.__version__) print(tf.sysconfig.get_build_info()) ``` **Common Causes:** - Wrong TensorFlow package (CPU-only version) - CUDA/cuDNN version mismatch - NVIDIA driver issues - GPU not visible to container (Docker) **Solutions:** ```bash # Install correct TensorFlow GPU package pip install tensorflow[and-cuda] # TF 2.15+ # or pip install tensorflow-gpu # Older versions # Verify CUDA compatibility # TF 2.15: CUDA 12.x, cuDNN 8.9 # TF 2.14: CUDA 11.8, cuDNN 8.7 # TF 2.13: CUDA 11.8, cuDNN 8.6 # For Docker, use nvidia-docker docker run --gpus all -it tensorflow/tensorflow:latest-gpu ``` ```python # Force GPU visibility import os os.environ['CUDA_VISIBLE_DEVICES'] = '0' # Use first GPU # Verify GPU is being used with tf.device('/GPU:0'): a = tf.constant([[1.0, 2.0], [3.0, 4.0]]) b = tf.constant([[1.0, 1.0], [0.0, 1.0]]) c = tf.matmul(a, b) print(c.device) # Should show GPU ``` ### 6. SavedModel Loading Errors **Symptoms:** - `OSError: SavedModel file does not exist` - `ValueError: Unknown layer` when loading - Version compatibility errors **Diagnostic Steps:** ```python # Check SavedModel structure import os for root, dirs, files in os.walk('saved_model_dir'): for file in files: print(os.path.join(root, file)) # Verify model signature loaded = tf.saved_model.load('saved_model_dir') print(list(loaded.signatures.keys())) ``` **Solutions:** ```python # Save model correctly model.save('my_model') # SavedModel format (recommended) model.save('my_model.keras') # Keras format # Load with custom objects custom_objects = { 'CustomLayer': CustomLayer, 'custom_loss': custom_loss } model = tf.keras.models.load_model('my_model', custom_objects=custom_objects) # For version mismatches, save weights only model.save_weights('model_weights.weights.h5') # Then rebuild model architecture and load weights new_model.load_weights('model_weights.weights.h5') ``` ### 7. Data Pipeline Issues **Symptoms:** - `InvalidArgumentError` during training - Slow training (input bottleneck) - Memory leaks during data loading **Diagnostic Steps:** ```python # Profile input pipeline import tensorflow as tf # Enable profiler tf.profiler.experimental.start('/tmp/logdir') # ... run training ... tf.profiler.experimental.stop() # Check dataset element spec print(dataset.element_spec) # Iterate and inspect for batch in dataset.take(1): print(f"Batch shape: {batch[0].shape}") print(f"Dtype: {batch[0].dtype}") ``` **Solutions:** ```python # Optimize pipeline dataset = tf.data.Dataset.from_tensor_slices((x, y)) dataset = dataset.cache() # Cache after expensive operations dataset = dataset.shuffle(buffer_size=1000) dataset = dataset.batch(32) dataset = dataset.prefetch(tf.data.AUTOTUNE) # Overlap data loading # Use parallel processing dataset = dataset.map( preprocess_fn, num_parallel_calls=tf.data.AUTOTUNE ) # Handle variable-length sequences dataset = dataset.padded_batch(32, padded_shapes=([None], [])) ``` ## Debugging Tools ### tf.debugging Module ```python # Shape assertions tf.debugging.assert_shapes([ (x, ('N', 'H', 'W', 'C')), (y, ('N', 'num_classes')) ]) # Value assertions tf.debugging.assert_non_negative(x) tf.debugging.assert_near(x, y, rtol=1e-5) tf.debugging.assert_equal(x.shape, expected_shape) # Numeric checking tf.debugging.check_numerics(tensor, "check: tensor contains NaN/Inf") tf.debugging.enable_check_numerics() # Global check # Type assertions tf.debugging.assert_type(x, tf.float32) ``` ### TensorBoard ```python # Set up TensorBoard logging log_dir = "logs/fit/" + datetime.datetime.now().strftime("%Y%m%d-%H%M%S") tensorboard_callback = tf.keras.callbacks.TensorBoard( log_dir=log_dir, histogram_freq=1, profile_batch='500,520' # Profile batches 500-520 ) model.fit( x_train, y_train, epochs=5, callbacks=[tensorboard_callback] ) # Launch TensorBoard # tensorboard --logdir logs/fit ``` ### TensorBoard Debugger V2 ```python # Enable debug info dumping tf.debugging.experimental.enable_dump_debug_info( logdir='/tmp/tfdbg2_logdir', tensor_debug_mode='FULL_HEALTH', circular_buffer_size=1000 ) # Run training... model.fit(x_train, y_train, epochs=5) # View in TensorBoard # tensorboard --logdir /tmp/tfdbg2_logdir ``` ### Eager Execution Debugging ```python # Enable eager execution (default in TF 2.x) tf.config.run_functions_eagerly(True) # Debug with breakpoints in @tf.function @tf.function def my_function(x): tf.print("Debug:", x) # Works in graph mode # Use tf.debugging.assert_* for runtime checks tf.debugging.assert_positive(x) return x * 2 # Disable tf.function for debugging @tf.function def buggy_function(x): # Temporarily remove @tf.function decorator # or use tf.config.run_functions_eagerly(True) return x ``` ### tf.print() for Graph Mode ```python @tf.function def compute(x): # Regular print won't work in graph mode tf.print("Shape:", tf.shape(x)) tf.print("Values:", x, summarize=-1) # -1 for all values tf.print("Stats - min:", tf.reduce_min(x), "max:", tf.reduce_max(x), "mean:", tf.reduce_mean(x)) return x * 2 ``` ### Memory Profiler ```python # Profile memory usage tf.config.experimental.set_memory_growth(gpu, True) # Use TensorFlow Profiler with tf.profiler.experimental.Profile('/tmp/logdir'): model.fit(x_train, y_train, epochs=1) # Check memory info tf.config.experimental.get_memory_info('GPU:0') # Returns: {'current': bytes, 'peak': bytes} ``` ## The Four Phases of TensorFlow Debugging ### Phase 1: Reproduce and Isolate 1. **Create minimal reproduction** ```python # Minimal test case import tensorflow as tf # Smallest possible model model = tf.keras.Sequential([ tf.keras.layers.Dense(10, input_shape=(5,)) ]) # Synthetic data x = tf.random.normal((32, 5)) y = tf.random.normal((32, 10)) model.compile(optimizer='adam', loss='mse') model.fit(x, y, epochs=1) ``` 2. **Enable eager execution for line-by-line debugging** ```python tf.config.run_functions_eagerly(True) ``` 3. **Add assertions at key points** ```python def debug_forward_pass(model, x): for i, layer in enumerate(model.layers): x = layer(x) tf.debugging.check_numerics(x, f"Layer {i} output") print(f"Layer {i}: {x.shape}, range=[{tf.reduce_min(x):.3f}, {tf.reduce_max(x):.3f}]") return x ``` ### Phase 2: Analyze and Understand 1. **Inspect tensor shapes throughout the pipeline** ```python def trace_shapes(model, x): shapes = [] for layer in model.layers: x = layer(x) shapes.append((layer.name, x.shape)) return shapes ``` 2. **Check gradient flow** ```python def analyze_gradients(model, x, y, loss_fn): with tf.GradientTape() as tape: pred = model(x, training=True) loss = loss_fn(y, pred) grads = tape.gradient(loss, model.trainable_variables) analysis = [] for var, grad in zip(model.trainable_variables, grads): if grad is None: analysis.append((var.name, "NONE - disconnected")) else: norm = tf.norm(grad).numpy() analysis.append((var.name, f"norm={norm:.6f}")) return analysis ``` 3. **Profile performance** ```python # Use tf.profiler tf.profiler.experimental.start('/tmp/logdir') model.fit(x, y, epochs=1) tf.profiler.experimental.stop() ``` ### Phase 3: Fix and Verify 1. **Apply targeted fixes based on diagnosis** - Shape issues: Add explicit reshapes and assertions - NaN issues: Add epsilon, reduce learning rate, clip gradients - OOM issues: Reduce batch size, enable memory growth - GPU issues: Check CUDA compatibility, install correct packages 2. **Verify fix doesn't break other functionality** ```python # Run comprehensive tests def test_model_components(): # Test forward pass output = model(sample_input) assert output.shape == expected_shape # Test backward pass with tf.GradientTape() as tape: loss = loss_fn(model(x), y) grads = tape.gradient(loss, model.trainable_variables) assert all(g is not None for g in grads) # Test save/load model.save('/tmp/test_model') loaded = tf.keras.models.load_model('/tmp/test_model') assert tf.reduce_all(model(x) == loaded(x)) ``` ### Phase 4: Prevent and Document 1. **Add permanent assertions for critical invariants** ```python class RobustModel(tf.keras.Model): def call(self, x, training=False): tf.debugging.assert_shapes([(x, ('batch', 'features'))]) x = self.layer1(x) tf.debugging.check_numerics(x, "After layer1") return self.output_layer(x) ``` 2. **Set up monitoring callbacks** ```python class NanCallback(tf.keras.callbacks.Callback): def on_batch_end(self, batch, logs=None): if logs and tf.math.is_nan(logs.get('loss', 0)): self.model.stop_training = True raise ValueError(f"NaN detected at batch {batch}") ``` 3. **Document the issue and solution** ```python # BUGFIX: Shape mismatch in attention layer # Issue: Input was (batch, seq, features) but attention expected (batch, heads, seq, features) # Solution: Added reshape before attention layer x = tf.reshape(x, [batch_size, num_heads, seq_len, -1]) ``` ## Quick Reference Commands ### Device and Configuration ```python # List devices tf.config.list_physical_devices() tf.config.list_physical_devices('GPU') # GPU memory growth gpus = tf.config.list_physical_devices('GPU') for gpu in gpus: tf.config.experimental.set_memory_growth(gpu, True) # Force CPU execution with tf.device('/CPU:0'): result = model(x) # Check if built with CUDA tf.test.is_built_with_cuda() ``` ### Debugging Assertions ```python # Numeric checks tf.debugging.check_numerics(tensor, message) tf.debugging.enable_check_numerics() # Shape checks tf.debugging.assert_shapes([(tensor, shape_tuple)]) tf.ensure_shape(tensor, shape) # Value checks tf.debugging.assert_positive(tensor) tf.debugging.assert_non_negative(tensor) tf.debugging.assert_near(a, b, rtol=1e-5) tf.debugging.assert_equal(a, b) tf.debugging.assert_less(a, b) tf.debugging.assert_greater(a, b) ``` ### Profiling and Logging ```python # TensorBoard logging tensorboard_callback = tf.keras.callbacks.TensorBoard( log_dir='./logs', histogram_freq=1 ) # Start profiler tf.profiler.experimental.start('/tmp/logdir') # ... code ... tf.profiler.experimental.stop() # Debug info for TensorBoard Debugger V2 tf.debugging.experimental.enable_dump_debug_info( '/tmp/tfdbg2', tensor_debug_mode='FULL_HEALTH' ) ``` ### Memory Management ```python # Clear session tf.keras.backend.clear_session() # Get memory info tf.config.experimental.get_memory_info('GPU:0') # Mixed precision tf.keras.mixed_precision.set_global_policy('mixed_float16') ``` ### Gradient Debugging ```python # Inspect gradients with tf.GradientTape() as tape: loss = compute_loss() gradients = tape.gradient(loss, model.trainable_variables) # Clip gradients gradients, _ = tf.clip_by_global_norm(gradients, 5.0) # Check for None gradients (disconnected graph) for var, grad in zip(model.trainable_variables, gradients): if grad is None: print(f"Warning: {var.name} has no gradient") ``` ## Version Compatibility Reference | TensorFlow | Python | CUDA | cuDNN | |------------|-----------|--------|-------| | 2.16.x | 3.9-3.12 | 12.3 | 8.9 | | 2.15.x | 3.9-3.11 | 12.2 | 8.9 | | 2.14.x | 3.9-3.11 | 11.8 | 8.7 | | 2.13.x | 3.8-3.11 | 11.8 | 8.6 | | 2.12.x | 3.8-3.11 | 11.8 | 8.6 | ## Additional Resources - [TensorFlow Debugging Guide](https://www.tensorflow.org/guide/effective_tf2#debugging) - [TensorBoard Debugger V2](https://www.tensorflow.org/tensorboard/debugger_v2) - [GPU Performance Analysis](https://www.tensorflow.org/guide/gpu_performance_analysis) - [Profiler Guide](https://www.tensorflow.org/guide/profiler)