--- description: "How NeMo Curator's streaming architecture processes data in constant memory" categories: ["architecture"] tags: ["deep-dive", "streaming", "batching", "memory", "data-flow"] personas: ["data-scientist-focused", "mle-focused"] difficulty: "intermediate" content_type: "concept" modality: "universal" --- # Streaming vs. Batch Processing Different inference stages have different compute requirements. NeMo Curator uses Ray streaming to increase GPU utilization and processing speed compared to traditional batch-all-at-once approaches. ## Batch Mode vs. Streaming Mode ### Batch Mode In batch mode, each stage processes the **entire dataset** before the next stage begins. Stages with different compute requirements (CPU-only tokenization, single-GPU classifiers, multi-GPU encoders) all run sequentially: | Aspect | Batch Mode | |--------|-----------| | **Execution** | One stage at a time across the full dataset | | **Memory usage** | Proportional to dataset size | | **GPU utilization** | Low — GPUs idle while CPU stages run, and vice versa | | **Time to first output** | After the entire pipeline finishes | ### Streaming Mode In streaming mode, data flows through the pipeline as discrete batches. Each stage processes its current batch and immediately passes it downstream, so **all stages run concurrently** on different batches: | Aspect | Streaming Mode | |--------|---------------| | **Execution** | All stages active simultaneously on different batches | | **Memory usage** | Constant (proportional to batch size, not dataset size) | | **GPU utilization** | High — stages with different hardware needs overlap | | **Time to first output** | After the first batch completes the pipeline | ## Why Streaming Is Faster Streaming with heterogeneous compute allows NeMo Curator to overlap stages that use different resources. For example, while a GPU inference stage processes batch N, a CPU tokenization stage can process batch N+1 simultaneously — neither blocks the other. This overlap improves throughput in pipelines that mix CPU and GPU work, because both happen in parallel rather than taking turns. Combined with auto-balancing, streaming enables Curator to rearrange resources so that GPU stage workers are **kept busy over 99% of the time** after an initial warm-up period. ## Heterogeneous Executors NeMo Curator supports streaming with multiple executors — Cosmos Xenna, Ray Data, and others — each optimized for different workload patterns. The executor handles scheduling, backpressure, and resource allocation so that streaming "just works" regardless of how many stages your pipeline has. ## Configuring Batch Size Batch size controls the trade-off between memory usage and throughput: ```python from nemo_curator.stages.resources import Resources # Configure batch size on a stage word_count_stage = WordCountStage().with_(batch_size=128) ``` - **Smaller batches**: Lower memory usage per batch. Ray may handle smaller batches more efficiently in some workloads. - **Larger batches**: More in-memory data per batch, which can reduce I/O overhead but uses more memory. ## When Streaming Matters Most - **Datasets exceed memory.** A 10 TB Common Crawl snapshot cannot fit in RAM, but it can be streamed in manageable chunks. - **Pipeline stages have different hardware needs.** CPU-only and GPU stages overlap instead of taking turns. - **You need early results.** Inspect output from the first batch while the rest of the dataset is still processing.