# JustDubit Training

Training tools for the **JustDubit**, built on Lightricks' LTX-2.

JustDubit training enables fine-tuning the LTX-2 model for synchronized audio-video generation, where reference video/audio conditions the generation of dubbed output with natural lip movements.

> For other training strategies (text-to-video, image-to-video, IC-LoRA), see the [main LTX-2 Trainer](https://github.com/Lightricks/LTX-2/tree/main/packages/ltx-trainer).

---

## 📋 Overview

**Training Features:**

- 🎙️ **Joint Audio-Video Training**: Train synchronized audio and video generation
- 📹 **Reference Conditioning**: Use reference video/audio to guide generation
- 🎭 **Masked Loss**: Focus training on lip/face regions with segmentation masks
- 🔧 **LoRA Training**: Efficient fine-tuning with Low-Rank Adaptation
- ⚡ **Cross-Attention Masking**: Improved audio-video synchronization

---

## 🚀 Quick Start

### 1. Installation

```bash
# From the repository root
uv sync --frozen
```

### 2. Download Dataset

Download the JustDubit training dataset from Hugging Face:

```bash
# Using huggingface-cli
huggingface-cli download justdubit/audiovisual_translation_dub --local-dir ./data/audiovisual_translation_dub --repo-type dataset

# Or using git lfs
git lfs install
git clone https://huggingface.co/datasets/justdubit/audiovisual_translation_dub ./data/audiovisual_translation_dub
```

The dataset contains paired video samples with:
- Target videos (dubbed output)
- Reference videos (original source)
- Face segmentation masks (optional, for masked loss training)
- Text captions with speaker and dialogue information

### 3. Preprocess Dataset

Preprocess the dataset to compute latent representations:

```bash
uv run python scripts/process_dataset.py ./data/audiovisual_translation_dub/dataset.json \
    --resolution-buckets "640x352x121" \
    --model-path /path/to/ltx-2-checkpoint.safetensors \
    --text-encoder-path /path/to/gemma-text-encoder \
    --with-audio \
    --reference-column reference_path \
    --mask-column mask_path  # Optional: only needed for masked loss training
```

**Preprocessing Arguments:**

| Argument | Description |
|----------|-------------|
| `--resolution-buckets` | Video dimensions as `WxHxF` (width x height x frames) |
| `--model-path` | Path to LTX-2 checkpoint |
| `--text-encoder-path` | Path to Gemma text encoder directory |
| `--with-audio` | Enable audio latent extraction |
| `--reference-column` | Dataset column containing reference video paths |
| `--mask-column` | (Optional) Dataset column containing face mask paths for masked loss |
| `--decode` | (Optional) Decode latents for verification |

This creates a `.precomputed` directory with:
```
data/audiovisual_translation_dub/
└── .precomputed/
    ├── latents/                  # Target video latents
    ├── conditions/               # Text embeddings
    ├── audio_latents/            # Target audio latents
    ├── reference_latents/        # Reference video latents
    ├── reference_audio_latents/  # Reference audio latents
    └── masks/                    # Face segmentation masks (if --mask-column provided)
```

### 4. Configure Training

Create a training configuration file `configs/justdubit.yaml`:

```yaml
# Model Configuration
model:
  model_path: "/path/to/ltx-2-checkpoint.safetensors"
  text_encoder_path: "/path/to/gemma-text-encoder"
  training_mode: "lora"

# LoRA Configuration
lora:
  rank: 128
  alpha: 128
  dropout: 0.0
  target_modules:
    - "to_k"
    - "to_q"
    - "to_v"
    - "to_out.0"
    - "net.0.proj"
    - "net.2"

# Training Strategy
training_strategy:
  name: "justdubit"
  first_frame_conditioning_p: 0.1
  with_audio: true
  enable_cross_attention_masking: true
  audio_latents_dir: "audio_latents"
  reference_latents_dir: "reference_latents"
  reference_audio_latents_dir: "reference_audio_latents"
  # Masked loss configuration (optional - for focusing training on face regions)
  # Set use_masked_loss to false if not using face masks
  mask_config:
    mask_dir: "masks"
    use_masked_loss: true  # Set to false if not using masks
    mask_loss_weight: 1.0
    background_loss_weight: 0.1
    mask_threshold: 0.1

# Optimization
optimization:
  learning_rate: 2e-4
  learning_rate_groups:
    "audio_attn": 1e-6
    "audio_ff": 1e-6
  steps: 2000
  batch_size: 1
  gradient_accumulation_steps: 1
  max_grad_norm: 1.0
  optimizer_type: "adamw"
  scheduler_type: "linear"
  enable_gradient_checkpointing: true

# Acceleration
acceleration:
  mixed_precision_mode: "bf16"
  quantization: null

# Data
data:
  preprocessed_data_root: "./data/audiovisual_translation_dub/.precomputed"
  num_dataloader_workers: 2

# Validation
validation:
  prompts:
    - "A woman is speaking English, saying: 'Hello, how are you today?'"
  reference_videos:
    - "./data/audiovisual_translation_dub/validation/sample.mp4"
  video_dims: [640, 352, 121]
  frame_rate: 25.0
  seed: 42
  inference_steps: 30
  interval: 100
  guidance_scale: 3.0
  generate_audio: true

# Checkpoints
checkpoints:
  interval: 250
  keep_last_n: -1

# Flow Matching
flow_matching:
  timestep_sampling_mode: "shifted_logit_normal"

# Output
seed: 42
output_dir: "outputs/justdubit"
```

### 5. Train

Run training with single GPU:

```bash
uv run python scripts/train.py configs/justdubit.yaml
```

For multi-GPU training with Accelerate:

```bash
accelerate config  # Configure distributed training
accelerate launch scripts/train.py configs/justdubit.yaml
```

---

## 🎛️ Key Configuration Options

### Training Strategy

| Option | Description |
|--------|-------------|
| `name` | Set to `"justdubit"` for video dubbing training |
| `first_frame_conditioning_p` | Probability of first-frame conditioning (default: 0.1) |
| `with_audio` | Enable joint audio-video training (default: true) |
| `enable_cross_attention_masking` | Mask cross-attention between reference tokens and noisy tokens |

### Masked Loss

| Option | Description |
|--------|-------------|
| `use_masked_loss` | Enable masked loss computation |
| `mask_loss_weight` | Weight for foreground (face) regions |
| `background_loss_weight` | Weight for background regions |
| `mask_threshold` | Binary mask threshold |

### Learning Rate Groups

For JustDubit, we recommend using lower learning rates for audio modules:

```yaml
learning_rate_groups:
  "audio_attn": 1e-6   # Audio attention layers
  "audio_ff": 1e-6     # Audio feed-forward layers
```

---

## 📁 Dataset Format

The dataset must be a JSON/JSONL file with the following structure:

```json
[
  {
    "caption": "The woman is speaking English, saying: 'Hello, world!'",
    "media_path": "target_001.mp4",
    "reference_path": "reference_001.mp4",
    "mask_path": "facemask_001.mp4"
  }
]
```

| Field | Required | Description |
|-------|----------|-------------|
| `caption` | ✅ | Text prompt with speaker, language, and dialogue |
| `media_path` | ✅ | Path to target (dubbed) video |
| `reference_path` | ✅ | Path to reference (source) video |
| `mask_path` | | (Optional) Path to face segmentation mask video for masked loss |

> **Note:** The `mask_path` field is only required if you want to train with masked loss (`use_masked_loss: true`), which focuses the loss computation on facial regions. If not using masked loss, you can omit this field and set `use_masked_loss: false` in your training config.

---

## 🔧 Requirements

- **Python 3.12+**
- **Linux with CUDA 12.1+**
- **Nvidia GPU with 80GB+ VRAM** (recommended)
- **Model Checkpoints**:
  - LTX-2 AV checkpoint - [Download](https://huggingface.co/Lightricks/LTX-2/resolve/main/ltx-2-19b-dev.safetensors)
  - Gemma text encoder - [Download](https://huggingface.co/google/gemma-3-12b-it-qat-q4_0-unquantized/tree/main)

---

## 🔗 Related

- **[JustDubit Pipeline](../ltx-pipelines/README.md)** - Inference pipeline for video dubbing
- **[LTX-2 Trainer](https://github.com/Lightricks/LTX-2/tree/main/packages/ltx-trainer)** - Other training strategies