# Lance Training Script Usage This document explains the purpose and common usage of the following training scripts: - `scripts/sft_lance_unified.sh` - `scripts/sft_lance_generation.sh` - `scripts/sft_lance_understand.sh` All three scripts ultimately launch `train/unified_train.py` via `accelerate launch`. The main differences are the default dataset config and the `VISUAL_GEN` switch. ## 1. Environment Setup The training release adds extra dependencies on top of the inference-only setup. If you installed the environment before the fine-tuning code was released, reinstall the dependencies from the updated `requirements.txt`: ```bash pip install -r requirements.txt ``` ## 2. Data Preparation Training reads local parquet files. Download example datasets from [Hugging Face](https://huggingface.co/datasets/bytedance-research/Lance_example_dataset) and place them under local `./datasets`. Expected local layout: ```text datasets/ ├── image2image/ ├── image2text/ ├── text2image/ ├── text2video/ ├── video2text/ └── video2video/ ``` The ready-to-use local configs live in `config/train_local/`, for example: | Task | Config | | --- | --- | | `t2i` | `config/train_local/t2i_local.yaml` | | `t2v` | `config/train_local/t2v_local.yaml` | | `i2i` | `config/train_local/i2i_local.yaml` | | `v2v` | `config/train_local/v2v_local.yaml` | | `i2t` | `config/train_local/i2t_local.yaml` | | `v2t` | `config/train_local/v2t_local.yaml` | For parquet schemas, supported task types, and custom dataset construction, see [train_dataset.md](train_dataset.md). ## 3. Launch Patterns After preparing the environment, model weights, and example datasets, launch one of the training scripts below. By default, the scripts use all GPUs visible to `nvidia-smi` on the current machine. To run on fewer GPUs, set `ARNOLD_WORKER_GPU` before launching, for example `ARNOLD_WORKER_GPU=1 bash scripts/sft_lance_unified.sh`. For smaller local machines, you can also lower dataloader workers, for example `NUM_WORKERS=2 ARNOLD_WORKER_GPU=1 bash scripts/sft_lance_unified.sh`. ### 3.1 Unified mixed training ```bash bash scripts/sft_lance_unified.sh ``` ### 3.2 Generation-task training ```bash bash scripts/sft_lance_generation.sh ``` For a local `t2v` run, override the dataset config and experiment name: ```bash DATASET_CONFIG_FILE=config/train_local/t2v_local.yaml \ VAL_DATASET_CONFIG_FILE=config/train_local/t2v_local.yaml \ WANDB_NAME=t2v_local_debug \ bash scripts/sft_lance_generation.sh ``` ### 3.3 Understanding-task training ```bash bash scripts/sft_lance_understand.sh ``` For a local `v2t` run, override the dataset config and experiment name: ```bash DATASET_CONFIG_FILE=config/train_local/v2t_local.yaml \ VAL_DATASET_CONFIG_FILE=config/train_local/v2t_local.yaml \ WANDB_NAME=v2t_local_debug \ bash scripts/sft_lance_understand.sh ``` **NOTE**: The scripts default to `MODEL_PATH=./downloads/Lance_3B_Video`, the unified video-capable checkpoint. For image-only fine-tuning, such as `t2i`, `i2i`, or `i2t`, you can switch `MODEL_PATH` to `./downloads/Lance_3B` before launch. ## 4. Training Script Selection These scripts expand shell variables into command-line arguments and pass them to `train/unified_train.py`. In practice, you should first decide which class of task you want to train. | Script | Default config | Default switches | Suitable scenarios | Common fields to modify | | --- | --- | --- | --- | --- | | `scripts/sft_lance_unified.sh` | `config/train_local/unified.yaml` | `VISUAL_UND=True`, `VISUAL_GEN=True` | Mixed understanding + generation training | `DATASET_CONFIG_FILE`, `VAL_DATASET_CONFIG_FILE`, `WANDB_NAME` | | `scripts/sft_lance_generation.sh` | `config/train_local/multi_gen.yaml` | `VISUAL_UND=True`, `VISUAL_GEN=True` | Generation tasks such as `t2i`, `t2v`, `i2i`, `v2v` | `DATASET_CONFIG_FILE`, `VAL_DATASET_CONFIG_FILE`, `WANDB_NAME` | | `scripts/sft_lance_understand.sh` | `config/train_local/multi_und.yaml` | `VISUAL_UND=True`, `VISUAL_GEN=False` | Understanding tasks such as `i2t`, `v2t` | `DATASET_CONFIG_FILE`, `VAL_DATASET_CONFIG_FILE`, `WANDB_NAME` | ## 5. Key Parameters to Modify First These are the parameters you should verify before most runs. Think of them as the first layer of knobs that usually need to be changed. | Parameter | Purpose | When to change | Recommendation | | --- | --- | --- | --- | | `DATASET_CONFIG_FILE` | Specifies the training dataset yaml | Almost every run | Point it to the dataset config you actually want to train | | `VAL_DATASET_CONFIG_FILE` | Specifies the validation dataset yaml | Validation is currently not supported | Keep the default value | | `WANDB_NAME` | Names the experiment | Almost every run | Include task name, dataset name, and date | | `VISUAL_UND` | Enables the visual understanding branch | Usually not changed often | Keep `True` for understanding tasks and most generation tasks | | `VISUAL_GEN` | Enables the visual generation branch | Must be checked when switching between understanding and generation | Set `False` for understanding tasks, `True` for generation tasks | | `SAVE_EVERY` | Checkpoint save interval | Commonly changed in both debug and formal runs | Smaller for debugging, larger for long runs | | `CKPT_DEBUG_STEPS` | Very early debug checkpoint | Commonly changed during debugging | Set to `-1` if you do not need early debug checkpoints | | `VALIDATION_STEP` | Validation interval | Validation is currently not supported | Keep `-1`; do not set it to a positive integer | | `NUM_SHARD` | Number of FSDP shards | When changing the parallelism strategy | Tune together with GPU count and memory budget | | `NUM_REPLICATE` | Number of replicas | Usually changes with `NUM_SHARD` | Computed as `TOTAL_RANK / NUM_SHARD` | ## 6. Two Switches That Are Easy to Misconfigure ### 6.1 `VISUAL_GEN` `VISUAL_GEN` controls whether the visual generation branch is enabled, including the VAE latent / flow matching / MSE path. - Common settings for generation tasks: - `VISUAL_UND=True` - `VISUAL_GEN=True` - Common settings for understanding tasks: - `VISUAL_UND=True` - `VISUAL_GEN=False` If you accidentally set `VISUAL_GEN=True` for an understanding task, but the batch does not contain the latent fields required by the generation branch, `Lance.forward(...)` may enter the wrong branch and fail. ### 6.2 `VALIDATION_STEP` All three scripts default to: ```bash VALIDATION_STEP=-1 ``` This means: - no fixed validation dataset is prepared - `validate_on_fixed_batch(...)` is never triggered in the training loop The validation logic in the training script has not been fully checked yet. Enabling validation with a positive value is currently not supported, so do not set values such as `VALIDATION_STEP=100`; keep it as `-1`. ## 7. Practical Recommendations 1. Use `sft_lance_understand.sh` first for pure understanding tasks. 2. Use `sft_lance_generation.sh` first for pure generation tasks. 3. Use `sft_lance_unified.sh` when you really want mixed-task training. 4. During debugging, prioritize changing: - `DATASET_CONFIG_FILE` - `WANDB_NAME` - `VISUAL_GEN` - `SAVE_EVERY` - `CKPT_DEBUG_STEPS` - `VALIDATION_STEP` 5. For local parquet training, verify first: - the yaml really points to a local parquet file - the `_local` dataset class matches the parquet schema - understanding tasks do not accidentally run with `VISUAL_GEN=True`