---
name: gpu-parallel-scheduling
description: "GPU-safe parallel processing patterns for KINTSUGI to prevent OOM crashes and ensure Jupyter-compatible progress output"
author: KINTSUGI Team
date: 2025-12-15
---

# GPU Parallel Scheduling - Research Notes

## Experiment Overview
| Item | Details |
|------|---------|
| **Date** | 2025-12-15 |
| **Goal** | Fix kernel crashes during parallel GPU processing in Notebook 2 |
| **Environment** | KINTSUGI pipeline, CuPy, multi-GPU HPC, Jupyter notebooks |
| **Status** | Success |

## Context
Notebook 2 (Cycle Processing) was crashing after Channel 1 completed when attempting to process remaining channels in parallel. The crash occurred immediately after the message "[PARALLEL] Processing channels [2, 3, 4] across GPUs..." with no further output.

## Root Cause Analysis

### Problem 1: Nested ThreadPoolExecutor Causing GPU Thread Explosion

The code structure was:
```python
# OUTER: Parallel channel processing
with ThreadPoolExecutor(max_workers=len(GPU_DEVICE_IDS)) as executor:
    for ch in channels:
        # INNER: Each channel spawns parallel z-plane workers
        process_channel_zplanes_parallel(ch, ...)
            # Inside this function:
            with ThreadPoolExecutor(max_workers=ZPLANES_PER_GPU) as inner_executor:
                # Process z-planes in parallel
```

With 2 GPUs, 3 channels, and `ZPLANES_PER_GPU=4`:
- Channel assignment: CH2→GPU0, CH3→GPU1, CH4→GPU0 (round-robin)
- GPU0 gets CH2 + CH4 simultaneously
- Each channel spawns 4 z-plane workers
- GPU0 runs 2 × 4 = 8 concurrent BaSiC corrections → **OOM CRASH**

### Problem 2: Jupyter Thread Output Suppression

Print statements from worker threads don't appear in Jupyter notebook output until the cell completes (or never). This made debugging difficult as processing appeared to stall.

### Problem 3: Missing GPU Memory Cleanup Between Channels

GPU memory from Channel 1 wasn't being freed before Channel 2 started, causing cumulative memory pressure.

## Verified Workflow

### Solution: Queue-Based GPU Allocation

Use a GPU queue to ensure **exactly 1 channel per GPU** at any time:

```python
import queue
from concurrent.futures import ThreadPoolExecutor, as_completed

# Create pool of available GPUs
gpu_queue = queue.Queue()
for dev_id in GPU_DEVICE_IDS:
    gpu_queue.put(dev_id)

def process_channel_with_gpu(ch):
    """Process channel, acquiring and releasing GPU from queue."""
    # ACQUIRE: Block until a GPU is available
    dev_id = gpu_queue.get()
    ch_start = time.time()

    try:
        process_channel_zplanes_parallel(
            ...,
            device_id=dev_id,
            zplanes_per_gpu=ZPLANES_PER_GPU,
            ...
        )
        elapsed = time.time() - ch_start

        # GPU cleanup before releasing
        try:
            import cupy as cp
            with cp.cuda.Device(dev_id):
                cp.get_default_memory_pool().free_all_blocks()
                cp.get_default_pinned_memory_pool().free_all_blocks()
        except Exception:
            pass
        gc.collect()

        return (ch, dev_id, elapsed, None)
    except Exception as e:
        return (ch, dev_id, time.time() - ch_start, str(e))
    finally:
        # RELEASE: Return GPU to pool for next channel
        gpu_queue.put(dev_id)

# Process with max_workers = number of GPUs
with ThreadPoolExecutor(max_workers=len(GPU_DEVICE_IDS)) as executor:
    futures = {executor.submit(process_channel_with_gpu, ch): ch
               for ch in remaining_channels}

    # Main thread prints progress (Jupyter-compatible)
    for future in as_completed(futures):
        ch, dev_id, elapsed, error = future.result()
        if error:
            log(f"  [GPU{dev_id}] Channel {ch} ERROR: {error}")
        else:
            log(f"  [GPU{dev_id}] Channel {ch} COMPLETE ({elapsed:.1f}s)")
```

### Key Principles

1. **max_workers = n_gpus**: Never more concurrent channels than GPUs
2. **Queue acquisition**: Each channel blocks until it gets a GPU
3. **Queue release in finally**: GPU always returns to pool, even on error
4. **GPU cleanup before release**: Free memory before another channel uses the GPU
5. **Main-thread progress**: Use `as_completed()` to print from main thread

## Failed Attempts (Critical)

| Attempt | Why it Failed | Lesson Learned |
|---------|---------------|----------------|
| Parallel channels with `iter_cycle(GPU_DEVICE_IDS)` | Pre-assigned GPUs didn't prevent multiple channels on same GPU | Need dynamic GPU allocation, not static assignment |
| `max_workers=len(GPU_DEVICE_IDS)` without queue | Channels could start on any GPU regardless of assignment | Queue ensures 1:1 GPU:channel mapping |
| Printing progress from worker threads | Output lost in Jupyter (thread stdout not captured) | Always print from main thread using `as_completed()` |
| GPU cleanup only at end of cell | Memory exhausted before cell completes | Cleanup after EACH channel, before releasing GPU |
| Nested ThreadPoolExecutor | Thread explosion: outer × inner workers | Inner parallelism is OK; outer must be limited to n_gpus |
| Adding status monitoring thread | Monitoring thread output also lost in Jupyter | Only main thread output is reliable in notebooks |

## Final Parameters

### GPU Queue Pattern
```python
# Initialize
gpu_queue = queue.Queue()
for dev_id in GPU_DEVICE_IDS:
    gpu_queue.put(dev_id)

# In worker
dev_id = gpu_queue.get()  # Blocks until available
try:
    # ... GPU work ...
finally:
    gpu_queue.put(dev_id)  # Always return
```

### Progress Output Pattern (Jupyter-safe)
```python
with ThreadPoolExecutor(max_workers=n_gpus) as executor:
    futures = {executor.submit(work_fn, item): item for item in items}

    for future in as_completed(futures):
        result = future.result()
        print(f"Completed: {result}")  # Main thread - visible in Jupyter
```

### GPU Cleanup Pattern
```python
try:
    import cupy as cp
    with cp.cuda.Device(device_id):
        cp.get_default_memory_pool().free_all_blocks()
        cp.get_default_pinned_memory_pool().free_all_blocks()
except Exception:
    pass
gc.collect()
```

## Key Insights

- **CuPy is not thread-safe** for concurrent operations on the same GPU
- **Jupyter suppresses thread output** - always use main thread for progress
- **Queue > static assignment** for dynamic GPU allocation
- **Cleanup before release** prevents memory accumulation
- **Inner parallelism is safe** (z-planes on single GPU) when outer is controlled
- **1 channel per GPU** rule prevents all OOM issues from parallel processing

## When to Apply This Pattern

- Multi-GPU parallel processing in Jupyter notebooks
- Any CuPy-based batch processing with parallelism
- When kernel crashes after first item completes in parallel loop
- When progress output disappears in parallel processing
- When OOM occurs despite having enough total GPU memory

## References

- CuPy Memory Management: https://docs.cupy.dev/en/stable/user_guide/memory.html
- Python concurrent.futures: https://docs.python.org/3/library/concurrent.futures.html
- KINTSUGI Notebook 2: Cycle Processing
- Related skill: `gpu-memory-cleanup` (per-channel cleanup pattern)