--- description: "Load video data into NeMo Curator from local paths or fsspec-supported storage, including explicit file list support" categories: ["video-curation"] tags: ["video", "load", "s3", "local", "file-list"] personas: ["data-scientist-focused", "mle-focused"] difficulty: "beginner" content_type: "howto" modality: "video-only" --- # Video Data Loading Load video data for curation using NeMo Curator. ## How it Works NeMo Curator loads videos with a composite stage that discovers files and extracts metadata: `VideoReader` is a composite stage that is broken down into a 1. Partitioning (list files) stage - Local paths use `FilePartitioningStage` to list files - Remote URLs (for example, `s3://`, `gcs://`) - use `ClientPartitioningStage` backed by `fsspec`. - Optional `input_list_json_path` allows explicit file lists under a root prefix. 2. Reader stage (`VideoReaderStage`) - This stage downloads the bytes (local or via `FSPath`) for each listed file - Calls `video.populate_metadata()` to extract resolution, fps, duration, encoding format, and other fields. You can set - `video_limit` to limit the number of files to be processed; use `None` for unlimited. - `verbose=True` to log detailed per-video information. --- ## Local and Cloud Use `VideoReader` to load videos from local paths or remote URLs. ### Example ```python from nemo_curator.pipeline import Pipeline from nemo_curator.stages.video.io.video_reader import VideoReader pipe = Pipeline(name="video_read", description="Read videos and extract metadata") pipe.add_stage(VideoReader(input_video_path="s3://my-bucket/videos/", video_limit=None, verbose=True)) pipe.run() ``` ## Explicit File List (JSON) For remote datasets, `ClientPartitioningStage` can use an explicit file list JSON. Each entry must be an absolute path under the specified root. ### JSON Format ```json [ "s3://my-bucket/datasets/videos/video1.mp4", "s3://my-bucket/datasets/videos/video2.mkv", "s3://my-bucket/datasets/more_videos/video3.webm" ] ``` If any entry is outside the root, the stage raises an error. ### Example ```python from nemo_curator.pipeline import Pipeline from nemo_curator.stages.client_partitioning import ClientPartitioningStage from nemo_curator.stages.video.io.video_reader import VideoReaderStage ROOT = "s3://my-bucket/datasets/" JSON_LIST = "s3://my-bucket/lists/videos.json" pipe = Pipeline(name="video_read_json_list", description="Read specific videos via JSON list") pipe.add_stage( ClientPartitioningStage( file_paths=ROOT, input_list_json_path=JSON_LIST, files_per_partition=1, file_extensions=[".mp4", ".mov", ".avi", ".mkv", ".webm"], ) ) pipe.add_stage(VideoReaderStage(verbose=True)) pipe.run() ``` ## Supported File Types The loader filters these video extensions by default: - `.mp4` - `.mov` - `.avi` - `.mkv` - `.webm` ## Metadata on Load After a successful read, the loader populates the following metadata fields for each video: - `size` (bytes) - `width`, `height` - `framerate` - `num_frames` - `duration` (seconds) - `video_codec`, `pixel_format`, `audio_codec` - `bit_rate_k` With `verbose=True`, the loader logs size, resolution, fps, duration, weight, and bit rate for each processed video. {/* end */}