--- description: "Process video data by splitting into clips, encoding, generating embeddings and captions, and removing duplicates" categories: ["video-curation"] tags: ["splitting", "encoding", "embeddings", "captioning", "deduplication"] personas: ["data-scientist-focused", "mle-focused"] difficulty: "intermediate" content_type: "workflow" modality: "video-only" --- # Process Data Use NeMo Curator stages to split videos into clips, encode them, generate embeddings or captions, and remove duplicates. ## How it Works Create a `Pipeline` and add stages for clip extraction, optional re-encoding and filtering, embeddings or captions, previews, and writing outputs. Each stage is modular and configurable to match your quality and performance needs. ## Processing Options Choose from the following stages to split, encode, filter, embed, caption, preview, and remove duplicates in your videos: Split long videos into shorter clips using fixed stride or scene-change detection. clips fixed-stride transnetv2 Encode clips to H.264 using CPU or GPU encoders and tune performance. clips h264_nvenc Apply motion-based filtering and aesthetic filtering to improve dataset quality. clips frames motion aesthetic Extract frames from clips or full videos for embeddings, filtering, and analysis. frames frames fps Generate clip-level embeddings with Cosmos-Embed1 for search and duplicate removal. clips cosmos-embed1 Produce clip captions and optional preview images for review workflows. clips frames captions preview Remove near-duplicates using semantic clustering and similarity with generated embeddings. clips semantic pairwise ## Write Outputs Persist clips, embeddings, previews, and metadata at the end of the pipeline using `ClipWriterStage`. Refer to [Save & Export](/curate-video/save-export) for directory layout and examples. Example (place as the final stage): ```python from nemo_curator.stages.video.io.clip_writer import ClipWriterStage pipeline.add_stage( ClipWriterStage( output_path=OUT_DIR, input_path=VIDEO_DIR, upload_clips=True, dry_run=False, generate_embeddings=True, generate_previews=False, generate_captions=False, embedding_algorithm="cosmos-embed1-224p", caption_models=[], enhanced_caption_models=[], verbose=True, ) ) ``` Path helpers are available to resolve common locations (such as `clips/`, `filtered_clips/`, `previews/`, `metas/v0/`, and `ce1_embd_parquet/`).