# PiD — Pixel Diffusion Decoder
> **TL;DR** — PiD is a plug-and-play diffusion decoder that replaces VAE/RAE decoders, turning latent representations directly into super-resolved pixels in a single pass.
https://github.com/user-attachments/assets/a556e2d4-5de5-4bcf-9daa-80f7ea6b2124
PiD reformulates the latent-to-pixel decoder as a conditional pixel-space diffusion
model, unifying decoding and upsampling into a single generative module.
It directly denoises in high-resolution pixel
space and produces a super-resolved image in one pass.
**[Paper](https://arxiv.org/abs/2605.23902), [Project Page](https://research.nvidia.com/labs/sil/projects/pid/), [Model Weights](https://huggingface.co/nvidia/PiD)**
[Yifan Lu](https://yifanlu0227.github.io/),
[Qi Wu](https://wilsoncernwq.github.io/),
[Jay Zhangjie Wu](https://zhangjiewu.github.io/),
[Zian Wang](https://www.cs.toronto.edu/~zianwang/),
[Huan Ling](https://www.cs.toronto.edu/~linghuan/),
[Sanja Fidler](https://www.cs.utoronto.ca/~fidler/),
[Xuanchi Ren](https://xuanchiren.com/)
## News
- 🔥 [June 2, 2026] PiD checkpoints for **SDXL**, **Qwen-Image** and **Qwen-Image-2512** are released. Check [HuggingFace](https://huggingface.co/nvidia/PiD).
- 🔥 [June 2, 2026] A new checkpoint for **FLUX.2 (2kto4k)** (with `_2606` suffix) that has no color drifting issue. See [here](docs/FLUX2_2kto4k_new_ckpt_compare.md) for comparison with the old one.
- 🔥 [June 2, 2026] We clean up the codebase and remove useless code. Torch.compile mode is also available now.
- 🚀 [May 27, 2026] PiD is now in [ComfyUI](https://github.com/Comfy-Org/ComfyUI/pull/14103)!
- 🚀 [May 25, 2026] Paper, code, and model weights released, with PiD options for **FLUX**, **FLUX.2**, **Z-Image**, **Z-Image-Turbo**, **SD3**, **DINOv2**, and **SigLIP**.
- 🔜 [Coming Soon] PiD undistilled checkpoints.
- ⏳ [Planned] Training scripts.
## Installation
> [!TIP]
> **Quick Start** — if your environment already has PyTorch (with CUDA), `transformers>=4.57.x`, and `diffusers>=0.37`, you don't need to build a new conda env. Just install the small set of utility deps the inference code pulls eagerly and you're ready to run the diffusers backbones (`flux`/`flux2`/`flux2-klein-4b`/`flux2-klein-9b`/`sd3`/`zimage`/`zimage-turbo`):
>
> ```bash
> pip install hydra-core omegaconf pyyaml \
> attrs einops loguru termcolor fvcore iopath wandb \
> imageio opencv-python-headless pandas \
> safetensors sentencepiece boto3 botocore
> pip install -e .
> ```
> To validate your environment is ready for inference, run `python verify_env.py`.
Full conda-managed install (preferred if you're starting from scratch):
```bash
conda env create -f environment.yml
conda activate pid
# 2. Install this package in editable mode.
pip install -e .
```
### Download Checkpoints
Checkpoints are hosted at [`nvidia/PiD`](https://huggingface.co/nvidia/PiD) on the HuggingFace.
Pull the `checkpoints/` folder into this repo:
```bash
hf download nvidia/PiD --local-dir . --include "checkpoints/*"
```
## Running inference
PiD ships two complementary entry points, each selecting a backbone with `--backbone`:
- `from_ldm.py` — text/class → latent diffusion → PiD decode
- `from_clean.py` — image → VAE encode → PiD decode
> [!IMPORTANT]
> Picking the checkpoint variant — `--pid_ckpt_type`
> Every entry point accepts `--pid_ckpt_type {2k,2kto4k}` (default `2k`):
>
> - **`2k`** — the original 2048px-trained decoder, trained with 2K resolution only. Multiple aspect ratios are supported, typically 2048 × 2048 (1:1), 2304 × 1728 (4:3), 1728 × 2304 (3:4), 2688 × 1536 (16:9), and 1536 × 2688 (9:16).
> - **`2kto4k`** — the up-to-4K-resolution decoder, trained with varying resolution (from 2K to 4K). Multiple aspect ratios are supported. Worse than `2k` at 2048px resolution.
>
> For the exact checkpoint path for each backbone, see [docs/checkpoints.md](docs/checkpoints.md).
| `--backbone` | Currently available `--pid_ckpt_type` |
|----------------|:-------------------------------------:|
| flux | `2k`, `2kto4k` |
| flux2 | `2k`, `2kto4k` |
| flux2-klein-4b | `2k`, `2kto4k` |
| flux2-klein-9b | `2k`, `2kto4k` |
| sd3 | `2k`, `2kto4k` |
| zimage | `2k`, `2kto4k` |
| zimage-turbo | `2k`, `2kto4k` |
| sdxl | `2kto4k` |
| qwenimage | `2kto4k` |
| qwenimage-2512 | `2kto4k` |
| dinov2 (RAE) | `2k` |
| siglip (Scale-RAE) | `2k` |
For the exact checkpoint path behind each `(backbone, --pid_ckpt_type)`, see [docs/checkpoints.md](docs/checkpoints.md).
### 📕 `from_ldm`: text / class → latent diffusion → PiD decode
Runs the chosen `--backbone` on a prompt, captures the intermediate `x_t` at user-specified denoising steps (early LDM
termination) and the final clean `x_0`, then decodes each captured latent with both the
native VAE / RAE decoder (baseline) and PiD.
#### Example 1 — Single-GPU, single prompt (Flux, default `2k` decoder)
Generating a 2048px image with Flux + PiD decode. Decoding latent from 24 and 28 (full) LDM steps.
```bash
PYTHONPATH=. python -m pid._src.inference.from_ldm --backbone flux \
--prompt "A photorealistic half-body portrait of a brown tabby cat with bold stripes sitting attentively on a rustic wooden kitchen table, soft morning light streaming sideways through a large window, fine fur detail and stripe patterns sharply visible, intense amber-green eyes in razor-sharp focus, warm farmhouse kitchen softly out of focus, cinematic shallow depth of field, ultra-detailed fur texture, photorealistic" \
--ldm_inference_steps 28 --save_xt_steps 24 \
--output_dir ./results/official_demo/flux \
--pid_inference_steps 4
```
#### Example 2 — Single-GPU, 4K decode with 4:3 aspect ratio (Flux, `2kto4k` decoder)
Same backbone as Example 1 but with `--resolution 4096,3072 --pid_ckpt_type 2kto4k`.
`--resolution` is the final output size, so the LDM runs at `1024,768` and
PiD decodes it to 4K.
```bash
PYTHONPATH=. python -m pid._src.inference.from_ldm --backbone flux \
--prompt "A close photograph of a cat looking through frosted glass beside a small pine branch, winter light, soft condensation, simple cozy composition, expressive eyes." \
--resolution 4096,3072 --pid_ckpt_type 2kto4k \
--ldm_inference_steps 28 --save_xt_steps 24 26 \
--output_dir ./results/official_demo/flux_4k_ar4_3
```
#### Example 3 — Multi-GPU with a prompt file (Z-Image) with torch.compile
`torchrun` shards `--prompt_file` across ranks; each rank writes to
`--output_dir` independently. We use `--compile` to enable torch.compile for faster inference,
the first call will be slow due to the compilation. We use `default` compilation mode, to get further speedup, change to the `max-autotune` mode in `_maybe_compile_net (pid/_src/models/pixeldit_model.py:210)`.
```bash
PYTHONPATH=. torchrun --nproc_per_node=4 \
-m pid._src.inference.from_ldm --backbone zimage \
--prompt_file pid/_src/inference/prompts/prompt_creative.txt \
--ldm_inference_steps 50 --save_xt_steps 46 \
--compile \
--output_dir ./results/official_demo/zimage
```
#### Example 4 — Multi-GPU, 4K decode (Z-Image-Turbo, `2kto4k` decoder)
Z-Image-Turbo defaults to 9 diffusers steps with `guidance_scale=0.0`. The final
clean latent `x0` is always saved and is the recommended Turbo output to inspect.
`--save_xt_steps 7` is optional; it saves an additional near-final `x_t` sample
for comparison. `--resolution 4096` means `H=4096, W=4096` and the LDM runs at `1024,1024`.
```bash
PYTHONPATH=. torchrun --nproc_per_node=4 \
-m pid._src.inference.from_ldm --backbone zimage-turbo \
--prompt_file pid/_src/inference/prompts/prompt_zimage_turbo.txt \
--resolution 4096 --pid_ckpt_type 2kto4k \
--output_dir ./results/official_demo/zimage_turbo_4k
```
#### `dinov2` / `siglip` backbones
The upstream RAE / Scale-RAE LDMs don't live in `diffusers` — see
[`docs/dinov2_siglip.md`](docs/dinov2_siglip.md) for setup and end-to-end
examples.
#### Suggested step settings per diffusers backbone
(See each script's docstring for the exact recipe.)
| Backbone | LDM steps flag | Default steps | Optional `--save_xt_steps` | Recommended latent |
|----------|-------------------------|---------------|----------------------------|--------------------|
| flux | `--ldm_inference_steps` | 28 | `22 24 26` | step `24` |
| sd3 | `--ldm_inference_steps` | 28 | `22 24 26` | step `24` |
| sdxl | `--ldm_inference_steps` | 30 | `24 26 28` | step `26` |
| flux2 | `--ldm_inference_steps` | 50 | `44 46 48` | step `46` |
| flux2-klein-4b | `--ldm_inference_steps` | 4 | `2 3` | `x0` |
| flux2-klein-9b | `--ldm_inference_steps` | 4 | `2 3` | `x0` |
| qwenimage | `--ldm_inference_steps` | 50 | `44 46 48` | step `44` |
| qwenimage-2512 | `--ldm_inference_steps` | 50 | `44 46 48` | step `44` |
| zimage | `--ldm_inference_steps` | 50 | `44 46 48` | step `46` |
| zimage-turbo | `--ldm_inference_steps` | 9 | `7` | `x0` |
---
### 📗 `from_clean`: image → VAE encode → PiD decode
No latent diffusion model is run. The input image is fed at its native resolution
(only center-cropped so each side is a multiple of 16), encoded by VAE, optionally
corrupted with Gaussian noise at each sigma in `--degrade_sigmas`, then decoded by PiD
at `--scale * vae_native_resolution`.
Single-GPU example (Flux):
```bash
PYTHONPATH=. python -m pid._src.inference.from_clean --backbone flux \
--manifest assets/clean_image_manifest.jsonl \
--degrade_sigmas 0.0 \
--output_dir ./results/official_demo_from_clean/flux \
--cfg_scale 1 --pid_inference_steps 4 --scale 4
```
You can pass a single image with `--input_path` and a prompt with `--prompt`
instead of `--manifest`, and a sigma sweep such as `--degrade_sigmas 0.0 0.2 0.4 0.8`
to decode noise-corrupted latents. Swap `--backbone` to use a different VAE
(`flux2` / `sd3` / `sdxl` / `qwenimage`); `sdxl` automatically uses its
variance-preserving noising form.
The `dinov2` / `siglip` `from_clean` flows take the same flags but with a different
`--scale` (8 for `siglip`); their encoders resize internally to their fixed native
interface (512 / 256) regardless of the input image size — see
[`docs/dinov2_siglip.md`](docs/dinov2_siglip.md).
## Repository layout
```
pid/_src/inference/
├── from_ldm.py # entrypoint: text/class → LDM → PiD decode (--backbone …)
├── from_clean.py # entrypoint: image → VAE encode → PiD decode (--backbone …)
├── cli_utils.py # argument parsers + backbone aliases for both entrypoints
├── decoder.py # shared PiD decode/save core (+ from_clean VAE round-trip & noising)
├── step_capture.py # diffusers callbacks: XtCaptureCallback / X0CaptureCallback
├── inference_utils.py # image/prompt/manifest IO, save_image, tags, AsyncUploader, S3 helpers
├── checkpoint_registry.py # backbone → PiD checkpoint mapping
├── pipeline_registry.py # diffusers backbone → HF pipeline mapping
├── rae_generation.py # DINOv2-RAE backend + run_rae_demo (--backbone dinov2)
├── scale_rae_generation.py# Scale-RAE backend + run_scale_rae_demo (--backbone siglip)
└── prompts/ # prompt files
```
## License
PiD codebase is licensed under the [Apache License 2.0](LICENSE).
## Contributing
See [`CONTRIBUTING.md`](CONTRIBUTING.md) for development setup, code style,
and the DCO sign-off requirement.
## Acknowledgments
The authors would like to acknowledge [Yongsheng Yu](https://www.yongshengyu.com/) and [Wei Xiong](https://wxiong.me/) for open-sourcing [PixelDiT](https://pixeldit.github.io/)'s model and weights, and thank Product Managers [Aditya Mahajan](https://www.linkedin.com/in/aditya-mahajan1) and [Matt Cragun](https://www.linkedin.com/in/mcragun/) for their valuable support and guidance.
## Citation
```bibtex
@article{lu2026pid,
title={PiD: Fast and High-Resolution Latent Decoding with Pixel Diffusion},
author={Lu, Yifan and Wu, Qi and Wu, Jay Zhangjie and Wang, Zian and Ling, Huan and Fidler, Sanja and Ren, Xuanchi},
journal={arXiv preprint arXiv:2605.23902},
year={2026}
}
```