^{Lance: Unified Multimodal Modeling by Multi-Task Synergy}

Fengyi Fu^*, Mengqi Huang^*,✉, Shaojin Wu^*, Yunsheng Jiang^*, Yufei Huo, Jianzhu Guo^✉,§
Hao Li, Yinghang Song, Fei Ding, Qian He, Zheren Fu, Zhendong Mao, Yongdong Zhang
ByteDance
^* Equal contribution ^✉ Corresponding authors ^§ Project lead

English | 简体中文

> **Note:** Lance is a research project rather than a polished product model. The released checkpoint was trained with up to 128 A100 GPUs, with training conducted up to 768x768 image generation and 480p, 12 FPS video generation. Our goal is to share a research artifact for studying unified image/video understanding, generation, and editing under a relatively small model and limited compute budget. Output quality may vary across prompts, resolutions, duration, motion complexity, and editing scenarios, and we see further opportunities to improve the post-training recipe. We appreciate constructive feedback from the community as we continue improving the project. ## 🔥 Updates - **`2026/06/03`**: 🚀 Lance is now supported in [vLLM-Omni](https://github.com/vllm-project/vllm-omni). See the [recipe](https://github.com/vllm-project/vllm-omni/blob/main/recipes/ByteDance/Lance.md)! - **`2026/05/29`**: 💪 Added support for Image-to-Video generation. [More to see](assets/docs/changelog/2026-05-29.md)! - **`2026/05/26`**: 🎨 The Gradio interface now supports image and video generation, editing, and understanding. [Try it out](assets/docs/changelog/2026-05-26.md)! - **`2026/05/25`**: ✨ The [Hugging Face Space](https://huggingface.co/spaces/bytedance-research/Lance) is now live, thanks to the HF team! - **`2026/05/19`**: 🤗 The technical report is now available on [arXiv](http://arxiv.org/abs/2605.18678). - **`2026/05/18`**: 🔥 We launched the [project homepage](https://lance-project.github.io/) and released the initial inference code and model weights on [GitHub](https://github.com/bytedance/Lance/) and [Hugging Face](https://huggingface.co/bytedance-research/Lance). ## 🌟 Highlights **Lance** is a 3B native unified multimodal model that supports **image and video understanding, generation, and editing** within a single framework. - **Efficient at 3B scale.** With only **3B active parameters**, Lance achieves competitive performance across image generation, image editing, and video generation benchmarks. - **Training from scratch.** Lance is trained from scratch with a staged multi-task recipe and within a budget of **up to 128 A100 GPUs**. We are actively updating and improving this repository. If you find any bugs or have suggestions, please feel free to open an issue or submit a pull request (PR) 💖.

Lance benchmark overview across image generation, image editing, video generation, and video understanding

## 📅 Roadmap - [ ] Release the fine-tuning code. ## 🎨 Demo

Show demo results

🔥 We recommend visiting our homepage for more visual results. 🔥

Text-to-Video

Video Editing

Multi-turn Consistency Editing

Intelligent Video Generation

## 🚀 Installation ### Recommended Environment - **Software:** Python 3.10+, CUDA 12.4+ (required) - **Hardware:** A GPU with at least 40GB VRAM is required for inference We have tested the following dependency combinations on NVIDIA A100: - PyTorch 2.8.0 + cu126 + flash-attn 2.8.3 - PyTorch 2.5.1 + cu124 + flash-attn 2.6.3 The default installation commands use the PyTorch 2.8.0 + cu126 setup. For other GPU models, please choose and validate a PyTorch build and a matching `flash-attn` version according to your driver, CUDA runtime, Python version, and GPU architecture. ### Installation Steps First, clone the repository: ```bash git clone https://github.com/bytedance/Lance.git cd Lance ``` Then, set up the environment: ```bash conda create -n Lance python=3.11 -y conda activate Lance pip install torch==2.8.0 torchvision==0.23.0 torchaudio==2.8.0 --index-url https://download.pytorch.org/whl/cu126 pip install -r requirements.txt pip install flash-attn==2.8.3 --no-build-isolation ``` > **Note:** If installing `flash-attn` from source fails, you can install a prebuilt wheel instead. The wheelhouse below is from a third-party repository and is provided for **reference only**; please verify that any wheel you install matches your Python, PyTorch and CUDA versions. > ```bash > pip install --no-cache-dir --no-deps --force-reinstall \ > "https://huggingface.co/strangertoolshf/flash_attention_2_wheelhouse/resolve/main/wheelhouse-flash_attn-2.8.3/linux_x86_64/torch2.8/cu12/abiTRUE/cp311/flash_attn-2.8.3+cu12torch2.8cxx11abiTRUE-cp311-cp311-linux_x86_64.whl" > ``` Then, download the model weights from [Lance-3B on Hugging Face](https://huggingface.co/bytedance-research/Lance) and place them in the `downloads/` directory: ```bash from huggingface_hub import snapshot_download save_dir = "./downloads/" repo_id = "bytedance-research/Lance" cache_dir = save_dir + "/cache" snapshot_download(cache_dir=cache_dir, local_dir=save_dir, repo_id=repo_id, local_dir_use_symlinks=False, resume_download=True, allow_patterns=["*.json", "*.safetensors", "*.bin", "*.py", "*.md", "*.txt","*.pth",], ) ``` ## 📚 Usage ### Inference #### Basic Usage ```bash bash inference_lance.sh ``` - Before running, please configure the inference parameters at the top of `inference_lance.sh`. - **Supported tasks:** `t2i`, `t2v`, `i2v`, `image_edit`, `video_edit`, `x2t_image`, and `x2t_video`. You can modify `TASK_DEFAULT_CONFIGS` in `inference_lance.py` to customize the default data samples for each task. - **Note:** For all tasks, we recommend following the `prompt` format used in the provided examples when writing input prompts, as this typically leads to better generation quality. #### Task Examples ##### Text-to-Video ```bash bash inference_lance.sh \ --TASK_NAME t2v \ --MODEL_PATH downloads/Lance_3B_Video \ --RESOLUTION video_480p \ --NUM_FRAMES 121 \ --VIDEO_HEIGHT 480 \ --VIDEO_WIDTH 848 \ --SAVE_PATH_GEN results/t2v ``` ##### Image-to-Video ```bash bash inference_lance.sh \ --TASK_NAME i2v \ --MODEL_PATH downloads/Lance_3B_Video \ --RESOLUTION video_480p \ --NUM_FRAMES 61 \ --VIDEO_HEIGHT 480 \ --VIDEO_WIDTH 848 \ --SAVE_PATH_GEN results/i2v ``` Optional parameters for video generation task examples: - `--ENHANCE_PROMPT true`: enable prompt rewrite for T2V/I2V. Prompt enhancement generally improves generation quality. Before enabling it, set `API_KEY`, `MODEL_NAME`, and `client` in `common/utils/caption_rewrite.py`. If no API key is configured there, prompt rewrite is skipped; in that case, we recommend **writing prompts in the style of the provided examples**. ##### Text-to-Image ```bash bash inference_lance.sh \ --TASK_NAME t2i \ --MODEL_PATH downloads/Lance_3B \ --RESOLUTION image_768res \ --VIDEO_HEIGHT 768 \ --VIDEO_WIDTH 768 \ --SAVE_PATH_GEN results/t2i ``` ##### Video Editing ```bash bash inference_lance.sh \ --TASK_NAME video_edit \ --MODEL_PATH downloads/Lance_3B_Video \ --RESOLUTION video_480p \ --SAVE_PATH_GEN results/video_edit ``` ##### Image Editing ```bash bash inference_lance.sh \ --TASK_NAME image_edit \ --MODEL_PATH downloads/Lance_3B \ --RESOLUTION image_768res \ --SAVE_PATH_GEN results/image_edit ``` ##### Video Understanding ```bash bash inference_lance.sh \ --TASK_NAME x2t_video \ --MODEL_PATH downloads/Lance_3B_Video \ --RESOLUTION video_480p \ --NUM_FRAMES 50 \ --SAVE_PATH_GEN results/x2t_video ``` ##### Image Understanding ```bash bash inference_lance.sh \ --TASK_NAME x2t_image \ --MODEL_PATH downloads/Lance_3B \ --RESOLUTION image_768res \ --SAVE_PATH_GEN results/x2t_image ``` Optional parameters for all task examples: - `--CONFIG_PATH path/to/config.json`: use a custom validation JSON/JSONL file instead of the task default example config.

Show task and parameter reference

#### Available Tasks | Task Name | Description | Example JSON | |------------------------|--------------------------------------------------|----------------------------------------------| | `t2v` | Text-to-Video generation | `config/examples/t2v_example.json` | | `t2i` | Text-to-Image generation | `config/examples/t2i_example.json` | | `i2v` | Image-to-Video generation | `config/examples/i2v_example.json` | | `image_edit` | Image editing | `config/examples/image_edit_example.json` | | `video_edit` | Video editing | `config/examples/video_edit_example.json` | | `x2t_image` | Image understanding | `config/examples/x2t_image_example.json` | | `x2t_video` | Video understanding | `config/examples/x2t_video_example.json` | For understanding examples: - `config/examples/x2t_image_example.json`: image understanding examples for visual question answering, reasoning and image captioning. - `config/examples/x2t_video_example.json`: video understanding examples for video question answering and video captioning. #### Parameters You can configure the following hyperparameters at the top of the `inference_lance.sh` script: | Parameter | Default Value | Description | | --- | --- | --- | | `MODEL_PATH` | `"downloads/Lance_3B"` | Path to the downloaded Lance model weights (`Lance_3B` or `Lance_3B_Video`). | | `NUM_GPUS` | `1` | Number of GPUs to use for inference. | | `VALIDATION_NUM_TIMESTEPS` | `30` | Number of denoising steps (e.g., 30 or 50). | | `VALIDATION_TIMESTEP_SHIFT` | `3.5` | Timestep shift parameter for flow matching scheduling. | | `CFG_TEXT_SCALE` | `4.0` | Classifier-Free Guidance (CFG) scale for text conditioning. | | `VALIDATION_DATA_SEED` | `42` | Random seed for generation reproducibility. | | `NUM_FRAMES` | `50` | Number of frames for video generation (Max: 121). *Unused for image tasks.* | | `VIDEO_HEIGHT` / `VIDEO_WIDTH`| `768` | Spatial resolution. *Unused for editing tasks (determined by input image/video).* | | `RESOLUTION` | `"video_480p"` | Base resolution preset (`image_768res` or `video_480p`). | | `CONFIG_PATH` | `""` | Optional path to a custom validation JSON/JSONL file. When empty, the task default example config is used. | | `ENHANCE_PROMPT` | `false` | Optional T2V/I2V prompt rewrite switch. T2V uses text-only rewrite; I2V uses text plus the input image. Prompt enhancement generally improves generation quality. Configure the rewrite API key and client in `common/utils/caption_rewrite.py` before setting this to `true`; without a key, we recommend writing prompts in the style of the provided examples. |

### 🖥️ Gradio You can launch the local Gradio demo for video/image generation, editing, and understanding: ```bash python lance_gradio.py --server-name 0.0.0.0 --server-port 7860 ``` ### Benchmarks

DPG-Bench Evaluation

Models	# Params.	Global	Entity	Attribute	Relation	Other	Overall
Generation-only Models
SDXL	3.5B	83.27	82.43	80.91	86.76	80.41	74.65
DALL-E 3	-	90.97	89.61	88.39	90.58	89.83	83.50
SD3-Medium	2B	87.90	91.01	88.83	80.70	88.68	84.08
FLUX.1-dev	12B	74.35	90.00	88.96	90.87	88.33	83.84
Qwen-Image	20B	91.32	91.56	92.02	94.31	92.73	88.32
Unified Models
Janus-Pro-7B	7B	86.90	88.90	89.40	89.32	89.48	84.19
OmniGen2	4B	88.81	88.83	90.18	89.37	90.27	83.57
Show-o2	7B	89.00	91.78	89.96	91.81	91.64	86.14
BAGEL^†	7B	88.94	90.37	91.29	90.82	88.67	85.07
InternVL-U	1.7B	90.39	90.78	90.68	90.29	88.77	85.18
TUNA	7B	90.42	91.68	90.94	91.87	90.73	86.76
TUNA-2	7B	89.50	91.40	92.07	91.91	88.81	86.54
🌟 Lance (Ours)	3B	83.89	91.07	89.36	93.38	80.80	84.67

^† indicates methods that use LLM rewriters for prompt rewriting before generation.

GenEval Evaluation

Models	# Params.	1-Obj.	2-Obj.	Count	Colors	Position	Attr.	Overall
Generation-only Models
SDXL	3.5B	0.98	0.74	0.39	0.85	0.15	0.23	0.55
DALL-E 3	-	0.96	0.87	0.47	0.83	0.43	0.45	0.67
SD3-Medium	2B	0.99	0.94	0.72	0.89	0.33	0.60	0.74
FLUX.1-dev	12B	0.98	0.93	0.75	0.93	0.68	0.65	0.82
Qwen-Image	20B	0.99	0.92	0.89	0.88	0.76	0.77	0.87
Unified Models
Janus-Pro-7B	7B	0.99	0.89	0.59	0.90	0.79	0.66	0.80
OmniGen2	4B	1.00	0.95	0.64	0.88	0.55	0.76	0.80
Show-o2	7B	1.00	0.87	0.58	0.92	0.52	0.62	0.76
BAGEL^†	7B	0.98	0.95	0.84	0.95	0.78	0.77	0.88
Mogao	7B	1.00	0.97	0.83	0.93	0.84	0.80	0.89
InternVL-U	1.7B	0.99	0.94	0.74	0.91	0.77	0.74	0.85
TUNA	7B	1.00	0.97	0.81	0.91	0.88	0.83	0.90
TUNA-2	7B	0.99	0.96	0.80	0.91	0.84	0.76	0.87
🌟 Lance (Ours)	3B	1.00	0.94	0.84	0.97	0.87	0.81	0.90

^† indicates methods that use LLM rewriters for prompt rewriting before generation.

GEdit-Bench Evaluation

Models	# Params.	BC	CA	MM	MC	PB	ST	SA	SR	SRp	TM	TT	Avg/G_O
Generation-only Models
Gemini 2.0	-	-	-	-	-	-	-	-	-	-	-	-	6.32
GPT Image 1	-	6.96	6.85	7.10	5.41	6.74	7.44	7.51	8.73	8.55	8.45	8.69	7.49
Qwen-Image-Edit	20B	8.23	8.30	7.33	8.05	7.49	6.74	8.57	8.09	8.29	8.48	8.50	8.01
Unified Models
Lumina-DiMOO	8B	3.43	4.27	3.08	2.77	4.74	5.19	4.44	3.80	4.38	2.68	4.20	3.91
Ovis-U1	1.2B	7.49	6.88	6.21	4.79	5.98	6.46	7.49	7.25	7.27	4.48	6.31	6.42
BAGEL	7B	7.32	6.91	6.38	4.75	4.57	6.15	7.90	7.16	7.02	7.32	6.22	6.52
InternVL-U	1.7B	7.08	7.05	6.38	7.02	6.03	6.27	7.13	6.55	6.33	6.59	6.85	6.66
InternVL-U (w/ CoT)	1.7B	7.05	7.87	6.50	6.99	5.77	6.10	7.33	7.16	7.12	7.36	6.46	6.88
🌟 Lance (Ours)	3B	7.73	7.74	7.28	7.83	7.50	7.03	7.64	7.85	7.71	4.46	7.57	7.30

VBench Evaluation (Video Generation)

Type	Model	# Params.	Total Score ↑
Gen. Only	ModelScope	1.7B	75.75
	LaVie	3B	77.08
	Show-1	6B	78.93
	AnimateDiff-V2	-	80.27
	VideoCrafter-2.0	-	80.44
	CogVideoX	5B	81.61
	Kling	-	81.85
	Open-Sora-2.0	-	81.71
	Gen-3	-	82.32
	Step-Video-T2V	30B	81.83
	Hunyuan Video	-	83.43
	Wan2.1-T2V	14B	83.69
Unified	HaproOmni	7B	78.10
	Emu3	8B	80.96
	VILA-U	7B	74.01
	Show-o2	2B	81.34
	TUNA	1.5B	84.06
	🌟 Lance (Ours)	3B	85.11

#### Running Benchmarks Ready-to-run benchmark scripts are provided under `benchmarks/`: | Benchmark | Modality | Script | |------------------------|----------|---------------------------------------------------------------| | GenEVAL (image gen) | Image | `benchmarks/image_gen/GenEVAL/sample_GenEVAL.sh` | | DPG (image gen) | Image | `benchmarks/image_gen/DPG/sample_DPG.sh` | | GEdit (image edit) | Image | `benchmarks/image_gen/GEdit/sample_GEdit.sh` | | VBench (video gen) | Video | `benchmarks/video_gen/Vbench/sample_vbench.sh` | ## 📄 License Copyright 2025 ByteDance Ltd. and/or its affiliates. ## 🙏 Acknowledgements We would like to thank the contributors of [BAGEL](https://github.com/ByteDance-Seed/bagel), [Qwen2.5-VL-3B-Instruct](https://huggingface.co/Qwen/Qwen2.5-VL-3B-Instruct), and [Wan2.2](https://github.com/Wan-Video/Wan2.2) for their open research and contributions. ## 💖 Citation If you find **Lance** useful for your project or research, welcome to 🌟 this repo and cite our work using the following BibTeX: ```bibtex @misc{fu2026lanceunifiedmultimodalmodeling, title = {Lance: Unified Multimodal Modeling by Multi-Task Synergy}, author = {Fengyi Fu and Mengqi Huang and Shaojin Wu and Yunsheng Jiang and Yufei Huo and Hao Li and Yinghang Song and Fei Ding and Jianzhu Guo and Qian He and Zheren Fu and Zhendong Mao and Yongdong Zhang}, year = {2026}, eprint = {2605.18678}, archivePrefix = {arXiv}, primaryClass = {cs.CV}, url = {https://arxiv.org/abs/2605.18678}, } ``` ## 📞 Contact For questions, issues, or collaborations, please contact [Mengqi Huang](https://corleone-huang.github.io/) and [Jianzhu Guo](https://guojianzhu.com/).