# JustDubit Pipeline A two-stage audio-video generation pipeline for **video dubbing** tasks, built on Lightricks' LTX-2 model. JustDubit generates synchronized audio and video from source content, enabling high-quality dubbing with natural lip movements and speech alignment. > For other LTX-2 pipelines (text-to-video, image-to-video, keyframe interpolation), see the [main LTX-2 repository](https://github.com/Lightricks/LTX-2/blob/main/packages/ltx-pipelines/). --- ## 📋 Overview **Key Features:** - 🎙️ **Synchronized Audio-Video Generation**: Generates aligned audio and video for dubbing tasks - 🎬 **Two-Stage Architecture**: Stage 1 generates at target resolution, Stage 2 upsamples 2x with refinement - 📹 **Video Conditioning**: Condition on source video for lip-sync and motion preservation - 🖼️ **Automatic First-Frame Extraction**: Seamlessly extracts conditioning frames from source video - 🔧 **LoRA Support**: Easy integration with custom LoRA adapters --- ## 🚀 Quick Start ### Installation ```bash # From the repository root uv sync --frozen ``` ### Usage ```bash # Model paths CHECKPOINT=/path/to/ltx-av-checkpoint.safetensors GEMMA_ROOT=/path/to/gemma-text-encoder SPATIAL_UPSAMPLER=/path/to/spatial-upscaler.safetensors DISTILLED_LORA=/path/to/distilled-lora.safetensors uv run python src/ltx_pipelines/pipeline_justdubit.py \ --checkpoint_path ${CHECKPOINT} \ --gemma_root ${GEMMA_ROOT} \ --distilled_lora_path ${DISTILLED_LORA} \ --distilled_lora_strength 1.0 \ --spatial_upsampler_path ${SPATIAL_UPSAMPLER} \ --lora /path/to/justdubit-lora.safetensors \ --lora_strength 1.0 \ --video_conditioning /path/to/source-video.mp4 1.0 \ --prompt "The man is speaking English, saying: 'Hello, world!' " \ --height 512 --width 768 \ --num_inference_steps 30 \ --cfg_guidance_scale 3.0 \ --frame_rate 25 \ --seed 42 \ --output_path ./output.mp4 ``` > **Resolution Note:** The `--height` and `--width` specify Stage 1 resolution. The final output is **2x upsampled** in Stage 2. For example, `512x768` input produces a `1024x1536` output video. --- ## 🎛️ CLI Arguments | Argument | Required | Description | |----------|----------|-------------| | `--checkpoint_path` | ✅ | Path to LTX-2 AV model checkpoint - [Download](https://huggingface.co/Lightricks/LTX-2/resolve/main/ltx-2-19b-dev.safetensors) | | `--gemma_root` | ✅ | Path to Gemma text encoder directory - [Download](https://huggingface.co/google/gemma-3-12b-it-qat-q4_0-unquantized/tree/main) | | `--distilled_lora_path` | ✅ | Path to distilled LoRA for Stage 2 - [Download](https://huggingface.co/Lightricks/LTX-2/resolve/main/ltx-2-19b-distilled-lora-384.safetensors) | | `--spatial_upsampler_path` | ✅ | Path to spatial upsampler model - [Download](https://huggingface.co/Lightricks/LTX-2/resolve/main/ltx-2-spatial-upscaler-x2-1.0.safetensors) | | `--output_path` | ✅ | Path for output MP4 file | | `--prompt` | ✅ | Text prompt describing desired output | | `--video_conditioning` | ✅ | Source video path and strength (e.g., `video.mp4 1.0`) | | `--distilled_lora_strength` | | Strength of distilled LoRA (default: 1.0) | | `--lora` | ✅ | JustDubit LoRA path (use with `--lora_strength`) - [Download](https://huggingface.co/justdubit/justdubit/resolve/main/ltx-2-19b-ic-lora-lipdubbing.safetensors) | | `--lora_strength` | | Strength of custom LoRA (default: 1.0) | | `--negative_prompt` | | Negative prompt for CFG guidance | | `--height` | | Stage 1 video height in pixels (default: 512, final output: 1024) | | `--width` | | Stage 1 video width in pixels (default: 768, final output: 1536) | | `--num_inference_steps` | | Number of denoising steps (default: 30) | > **Note:** The final output resolution is **2x** the specified height/width due to Stage 2 upsampling. For example, `--height 512 --width 768` produces a **1024x1536** output video. | `--cfg_guidance_scale` | | CFG guidance scale (default: 3.0) | | `--frame_rate` | | Output frame rate in fps (default: 25) | | `--seed` | | Random seed for reproducibility | --- ## 📝 Prompt Format Prompts for JustDubit should follow a specific structure to achieve best results: ``` [Speaker] is speaking [Language/Accent], saying: "[Dialogue]" ``` ### Components | Component | Description | Examples | |-----------|-------------|----------| | **Speaker** | Description of who is speaking | "The girl", "The man", "The young woman", "The elderly man" | | **Language/Accent** | The language and accent style | "standard English", "British English", "American English", "French" | | **Dialogue** | The actual words to be spoken (in quotes) | "Bonjour!", "I'm allowed to do whatever I want." | ### Examples ```text A young woman is speaking American English, saying: "This is so exciting! I can't wait to get started." The elderly gentleman is speaking French, saying: "Bonjour, comment allez-vous aujourd'hui?" ``` ### Tips - **Be specific about the speaker**: Match the speaker description to the person visible in the source video - **Specify the accent**: Adding accent details (e.g., "British English", "American English") helps generate more natural speech - **Use natural punctuation**: Include commas, periods, and ellipses in the dialogue for natural pacing ### Translating Dialogues with LLMs For dubbing videos into different languages, we recommend using an LLM to translate the original dialogue. Use the following system prompt for best results: ``` You are a professional translator. Your task is to translate the given dialogue into the target language while preserving the original meaning, tone, and linguistic style as closely as possible. Guidelines: - Keep the intent and nuance of each sentence intact. - Preserve formality or informality, emphasis, repetition, and emotional intensity using linguistic means only. - Translate idioms and expressions naturally into the target language rather than literally, when appropriate. - Maintain similar sentence structure and flow where possible. - Do not add, omit, or reinterpret content. - Output only the translated text, keeping any original formatting or speaker labels. ``` Then construct your JustDubit prompt with the translated dialogue: ``` The woman is speaking French, saying: "[translated dialogue here]" ``` --- ## 🐍 Python API ```python from ltx_pipelines.pipeline_justdubit import JustDubitPipeline from ltx_core.loader import LoraPathStrengthAndSDOps, LTXV_LORA_COMFY_RENAMING_MAP from ltx_core.model.video_vae import TilingConfig from ltx_pipelines.utils.media_io import encode_video from ltx_pipelines.utils.constants import AUDIO_SAMPLE_RATE # Initialize pipeline pipeline = JustDubitPipeline( checkpoint_path="/path/to/checkpoint.safetensors", distilled_lora_path="/path/to/distilled-lora.safetensors", distilled_lora_strength=1.0, spatial_upsampler_path="/path/to/upsampler.safetensors", gemma_root="/path/to/gemma", loras=[ LoraPathStrengthAndSDOps( "/path/to/justdubit-lora.safetensors", 1.0, LTXV_LORA_COMFY_RENAMING_MAP ) ], ) # Run inference # Note: height/width are Stage 1 resolution; final output is 2x (1024x1536) video, audio = pipeline( prompt="The man is speaking English, saying: 'hello world!' ", negative_prompt="blurry, low quality, distorted", seed=42, height=512, # Stage 1 height (final output: 1024) width=768, # Stage 1 width (final output: 1536) num_frames=121, frame_rate=25, num_inference_steps=30, cfg_guidance_scale=3.0, images=[], # Auto-extracted from video conditioning video_conditioning=[("/path/to/source-video.mp4", 1.0)], tiling_config=TilingConfig.default(), ) # Save output encode_video( video=video, fps=25, audio=audio, audio_sample_rate=AUDIO_SAMPLE_RATE, output_path="./output.mp4", ) ``` --- ## 🏗️ Pipeline Architecture JustDubit follows a two-stage architecture optimized for video dubbing: ### Stage 1: Generation 1. **Text Encoding**: Prompt encoded via Gemma into video and audio context embeddings 2. **Conditioning Setup**: - First frame auto-extracted from source video for image conditioning - Audio extracted from source video for audio conditioning - Video frames encoded as latent conditionings 3. **Diffusion Process**: - CFG-guided denoising for specified number of steps - Simultaneous video and audio latent generation ### Stage 2: Refinement 1. **Upsampling**: Video latent upsampled 2x using spatial upsampler 2. **Distilled Refinement**: Fast high-resolution refinement with distilled LoRA 3. **Audio Passthrough**: Audio latent used as conditioning (no re-denoising) ### Decoding 1. **Video Decoding**: VAE decodes video latent to pixel space 2. **Audio Decoding**: Audio VAE + vocoder decodes audio latent to waveform 3. **Output**: Synchronized MP4 with video and audio tracks --- ### 🔧 Required Model Checkpoints Download the required model files from Hugging Face: | Model | Description | Download | |-------|-------------|----------| | **LTX-2 AV Checkpoint** | Main model checkpoint | [`ltx-2-19b-dev.safetensors`](https://huggingface.co/Lightricks/LTX-2/resolve/main/ltx-2-19b-dev.safetensors) | | **JustDubit LoRA** 💋 | LoRA for lip-sync dubbing | [`ltx-2-19b-ic-lora-lipdubbing.safetensors`](https://huggingface.co/justdubit/justdubit/resolve/main/ltx-2-19b-ic-lora-lipdubbing.safetensors) | | **Distilled LoRA** | Required for Stage 2 refinement | [`ltx-2-19b-distilled-lora-384.safetensors`](https://huggingface.co/Lightricks/LTX-2/resolve/main/ltx-2-19b-distilled-lora-384.safetensors) | | **Spatial Upscaler** | Required for 2x upsampling | [`ltx-2-spatial-upscaler-x2-1.0.safetensors`](https://huggingface.co/Lightricks/LTX-2/resolve/main/ltx-2-spatial-upscaler-x2-1.0.safetensors) | | **Gemma Text Encoder** | Text encoder (download all assets) | [`gemma-3-12b-it-qat-q4_0-unquantized`](https://huggingface.co/google/gemma-3-12b-it-qat-q4_0-unquantized/tree/main) | --- ## 🔗 Related - **[Offical LTX-2 Pipelines](https://github.com/Lightricks/LTX-2/blob/main/packages/ltx-pipelines/)** - Other pipelines (text-to-video, image-to-video, keyframe interpolation) - **[Train your own JustDubIt](../ltx-trainer/README.md)** - Training tools