# L2P: Unlocking Latent Potential for Pixel Generation

Project Page arXiv Dataset HF Space

An efficient transfer paradigm enabling high-quality, end-to-end pixel-space diffusion with minimal computational overhead and data requirements.

⭐ If L2P helps your research or product, please consider giving the repo a star ⭐
--- ## πŸ“° News - **\[2026.05.12\]** Technical report released. - **\[2026.05.22\]** 1K-resolution training code, inference code, weights, and dataset released. - **\[2026.05.23\]** Online [demo](https://huggingface.co/spaces/multimodalart/z-image-6b-pixel-space). (Thanks to [multimodalart](https://huggingface.co/multimodalart) for the support!) --- ## πŸ—ΊοΈ Roadmap | Status | Item | | :---: | :--- | | βœ… | 1K inference code & weights | | βœ… | Training code | | πŸ› οΈ | 4K/8K/10K UHR generation | | πŸ› οΈ | Compatibility with more LDM model| --- ## πŸ“¦ Installation ```bash git clone https://github.com/TencentYoutuResearch/T2I-L2P.git cd T2I-L2P pip install -e . ``` --- ## 🎨 Inference Checkpoint: | Model | Params | HuggingFace | |-------|--------|-------------| |L2P-z-image (1k resolution) |6B |[πŸ€—](https://huggingface.co/zhen-nan/L2P) | ```python import torch from diffsynth.pipelines.z_image_L2P import ZImagePipeline, ModelConfig main_model_path = "/path/model-1k-merge.safetensors" text_encoder_paths = [ "/path/Z-Image-Turbo/text_encoder/model-00001-of-00003.safetensors", "/path/Z-Image-Turbo/text_encoder/model-00002-of-00003.safetensors", "/path/Z-Image-Turbo/text_encoder/model-00003-of-00003.safetensors", ] tokenizer_path = "/path/Z-Image-Turbo/tokenizer" pipe = ZImagePipeline.from_pretrained( torch_dtype=torch.bfloat16, device="cuda", model_configs=[ ModelConfig(path=[main_model_path]), ModelConfig(path=text_encoder_paths), ], tokenizer_config=ModelConfig(path=tokenizer_path), ) prompt = "an origami pig on fire in the middle of a dark room with a pentagram on the floor" image = pipe( prompt=prompt, seed=42, rand_device="cuda", num_inference_steps=30, cfg_scale=2.0, height=1024, width=1024, ) image.save("example.png") ``` ### Gradio Demo First, install gradio: ```bash pip install gradio ``` Launch a multi-GPU web UI: ```bash python app.py ``` The demo auto-detects free GPUs, dispatches each request to an idle device, and exposes a Gradio interface at `http://0.0.0.0:23231`. --- ## πŸ‹οΈ Training The full training pipeline consists of four steps: **(1)** prepare the Z-Image base weights β†’ **(2)** convert them into a pixel-space initialization β†’ **(3)** launch training β†’ **(4)** merge the trained delta back with the pixel-init weights for inference. ### Step 1 Β· Prepare Z-Image weights Download the official **Z-Image-Turbo** checkpoint from Hugging Face: - πŸ€— [Tongyi-MAI/Z-Image-Turbo](https://huggingface.co/Tongyi-MAI/Z-Image-Turbo) ### Step 2 Β· Offline weight conversion (latent β†’ pixel init) Convert the latent-space DiT weights into a **pixel-space initialization** that L2P can fine-tune from: ```bash python examples/z_image/L2P_convert_weight.py \ --latent_ckpt_files \ /path/to/Z-Image-Turbo/transformer/diffusion_pytorch_model-00001-of-00003.safetensors \ /path/to/Z-Image-Turbo/transformer/diffusion_pytorch_model-00002-of-00003.safetensors \ /path/to/Z-Image-Turbo/transformer/diffusion_pytorch_model-00003-of-00003.safetensors \ --output_path ./pretrain_weight/Z-Image-Pixel-Init/diffusion_pytorch_model.safetensors ``` ### Step 3 Β· Launch training **Standard training** : ```bash bash train_run.sh ``` **Low-VRAM training** (single GPU < 24 GB VRAM): ```bash bash train_run_low_VRAM.sh ``` #### Dataset format Provide a directory of images plus a CSV metadata file: ``` data/ β”œβ”€β”€ images/ # raw image folder └── metadata.csv # columns: file_name, text, ... ``` ### Step 4 Β· Offline weight merge (for inference) ```bash python merge_weights.py \ --file_a ./models/train/L2P_Standard/step-xxx.safetensors \ --file_b ./pretrain_weight/Z-Image-Pixel-Init/diffusion_pytorch_model.safetensors \ --file_out ./models/train/L2P_Standard/model-merge.safetensors ``` - `--file_a`: trained checkpoint from Step 3 - `--file_b`: pixel-init weights from Step 2 - `--file_out`: merged single-file weight --- ## πŸ“œ Citation If you find this work useful, please consider citing: ```bibtex @article{chen2026l2p, title = {L2P: Unlocking Latent Potential for Pixel Generation}, author = {Chen, Zhennan and Zhu, Junwei and Chen, Xu and Zhang, Jiangning and Chen, Jiawei and Zeng, Zhuoqi and Zhang, Wei and Wang, Chengjie and Yang, Jian and Tai, Ying}, journal = {arXiv preprint arXiv:2605.12013}, year = {2026} } @article{chen2025dip, title = {DiP: Taming Diffusion Models in Pixel Space}, author = {Chen, Zhennan and Zhu, Junwei and Chen, Xu and Zhang, Jiangning and Hu, Xiaobin and Zhao, Hanzhen and Wang, Chengjie and Yang, Jian and Tai, Ying}, journal = {arXiv preprint arXiv:2511.18822}, year = {2025} } ``` --- ## πŸ™ Acknowledgements L2P is built upon the excellent open-source work of [**DiffSynth-Studio**](https://github.com/modelscope/DiffSynth-Studio), [**Z-Image**](https://github.com/Tongyi-MAI/Z-Image).