---
description: "How NeMo Curator automatically balances resources across pipeline stages"
categories: ["architecture"]
tags: ["deep-dive", "auto-balancing", "scheduling", "performance"]
personas: ["data-scientist-focused", "mle-focused"]
difficulty: "intermediate"
content_type: "concept"
modality: "universal"
---

# Auto-Balancing Heterogeneous Models

NeMo Curator auto-balances resources at the application level across pipeline stages to maximize throughput. This means you can focus on defining your curation logic rather than manually tuning parallelism.

## The Problem: Unbalanced Pipelines

In a typical curation pipeline, stages have very different processing speeds. Consider a video pipeline with a fast stage, a slow stage, and a medium stage — all sharing a fixed GPU budget:

**Without auto-balancing (3 GPUs, 1 worker per stage):**
- The **fast stage** emits 4 tasks/s, but the **slow stage** can only handle 1 task/s. Jobs back up in the queue — 3 tasks accumulate per second, eventually causing memory pressure.
- The **medium stage** produces 2 tasks/s but is limited by the slow stage upstream, so it's starved for work. In practice only ~1 task/s is realized.
- 4 queued tasks per second could have been processed, but the final stage is starved.

**With auto-balancing (7 GPUs, scaled workers):**
- The executor detects the bottleneck and scales the **slow stage to 4× workers** and the **medium stage to 2× workers**.
- Now every stage sustains **4 tasks/s** throughput. Queues stay relatively clear — new jobs are picked up promptly.
- **Result: 4× throughput improvement** by intelligently redistributing the same GPU budget.

## How Auto-Balancing Works

The executor monitors the throughput and queue depth of each stage at runtime and uses this information to:

1. **Monitor throughput of different stages** and rebalance resources at regular intervals, shifting GPU/CPU allocations toward bottleneck stages.
2. **Apply backpressure.** When a downstream stage can't keep up, upstream stages slow their output rate rather than buffering unbounded data in memory. This reduces memory pressure and prevents spilling to disk.
3. **Scale workers dynamically.** If a stage is falling behind, the executor allocates additional workers to that stage (within the available resource budget).

## What This Means for You

- **No manual parallelism tuning.** You don't need to calculate the optimal number of workers per stage — the executor adapts at runtime.
- **Predictable memory usage.** Backpressure prevents unbounded buffering, so memory usage stays stable even with unbalanced stages.
- **Efficient hardware utilization.** Resources shift toward the current bottleneck instead of being statically allocated.

## Monitoring Stage Balance

Use the Ray Dashboard to monitor how the executor is balancing your pipeline. If you notice a persistent bottleneck that auto-balancing can't resolve (for example, a stage that needs more GPU memory than is available), consider splitting the pipeline or scaling your cluster.