--- description: "Understanding data flow in video curation pipelines including Ray object store and streaming optimization" categories: ["concepts-architecture"] tags: ["data-flow", "distributed", "ray", "streaming", "performance", "video-curation"] personas: ["data-scientist-focused", "mle-focused"] difficulty: "intermediate" content_type: "concept" modality: "video-only" --- # Data Flow Understanding how data moves through NeMo Curator's video curation pipelines is key to optimizing performance and resource usage. - Data moves between stages via Ray's distributed object store, enabling efficient, in-memory transfer between distributed actors. - In streaming mode (where stages operate continuously rather than in batches), the executor returns only final-stage outputs while keeping intermediate state in memory. This reduces I/O overhead and significantly improves throughput. - The auto-scaling component continuously balances resources to maximize pipeline throughput, dynamically allocating workers to stages as needed. - Writer stages persist outputs at the end of the pipeline, including clip media, embeddings (pickle and parquet variants), and metadata JSON files. Together, these components enable efficient processing of large-scale video datasets with minimal data movement and optimal use of available hardware. ## Writer Output Layout Writer stages produce the following directories under the configured output path: - `clips/`: MP4 clip files - `filtered_clips/`: MP4 files for filtered clips - `previews/`: WebP preview images for windows - `metas/v0/`: Per-clip JSON metadata files - `ce1_embd/`: Per-clip embeddings (pickle) for Cosmos-Embed1 - `ce1_embd_parquet/`: Aggregated per-video embeddings (parquet) for Cosmos-Embed1 - `processed_videos/`: Per-video JSON metadata files - `processed_clip_chunks/`: Per-clip-chunk JSON statistics