--- license: apache-2.0 pipeline_tag: unconditional-image-generation --- # PixNerd: Pixel Neural Field Diffusion

PixNerd is a novel pixel-space diffusion transformer for image generation, introduced in the paper [PixNerd: Pixel Neural Field Diffusion](https://huggingface.co/papers/2507.23268). Unlike conventional diffusion models that depend on a compressed latent space shaped by a pre-trained VAE, PixNerd proposes to model patch-wise decoding with a neural field. This results in a single-scale, single-stage, efficient, and end-to-end solution that directly operates in pixel space, avoiding accumulated errors and decoding artifacts.

PixNerd Architecture Diagram

### ✨ Key Highlights * **Efficient Pixel-Space Diffusion**: Directly models image generation in pixel space, eliminating the need for VAEs and their associated complexities or artifacts. * **Neural Field Decoding**: Employs neural fields for patch-wise decoding, improving the modeling of high-frequency details. * **Single-Stage & End-to-End**: Offers a simplified, efficient training and inference paradigm without complex cascade pipelines. * **High Performance**: Achieves competitive FID scores on ImageNet 256x256 (2.15 FID) and 512x512 (2.84 FID) for unconditional image generation. * **Text-to-Image Extension**: The framework is extensible to text-to-image applications, achieving strong results on benchmarks like GenEval (0.73 overall score) and DPG (80.9 overall score). ## Visualizations Below are sample images generated by PixNerd, showcasing its capabilities:

PixNerd Teaser
PixNerd Multi-Resolution Examples

## Checkpoints The following checkpoints are available: | Dataset | Model | Params | FID | HuggingFace | |---------------|---------------|--------|-------|---------------------------------------| | ImageNet256 | PixNerd-XL/16 | 700M | 2.15 | [🤗](https://huggingface.co/MCG-NJU/PixNerd-XL-P16-C2I) | | ImageNet512 | PixNerd-XL/16 | 700M | 2.84 | [🤗](https://huggingface.co/MCG-NJU/PixNerd-XL-P16-C2I) | | Dataset | Model | Params | GenEval | DPG | HuggingFace | |---------------|---------------|--------|------|------|----------------------------------------------------------| | Text-to-Image | PixNerd-XXL/16| 1.2B | 0.73 | 80.9 | [🤗](https://huggingface.co/MCG-NJU/PixNerd-XXL-P16-T2I) | ## Online Demos You can try out the PixNerd-XXL/16 (text-to-image) model on our Hugging Face Space demo: [https://huggingface.co/spaces/MCG-NJU/PixNerd](https://huggingface.co/spaces/MCG-NJU/PixNerd). To host a local Gradio demo for text-to-image applications, run the following command after setting up the environment: ```bash python app.py --config configs_t2i/inference_heavydecoder.yaml --ckpt_path=XXX.ckpt ``` ## Usage For image generation (C2i for ImageNet), you can use the provided codebase. First, install the required dependencies: ```bash # for installation pip install -r requirements.txt ``` Then, run inference using the `main.py` script (replace `XXX.ckpt` with your checkpoint path): ```bash # for inference python main.py predict -c configs_c2i/pix256std1_repa_pixnerd_xl.yaml --ckpt_path=XXX.ckpt # or specify the GPU(s) to use: CUDA_VISIBLE_DEVICES=0,1, python main.py predict -c configs_c2i/pix256std1_repa_pixnerd_xl.yaml --ckpt_path=XXX.ckpt ``` For more details on training and evaluation for both C2i and T2i applications, please refer to the [official GitHub repository](https://github.com/MCG-NJU/PixNerd). ## Citation If you find this work useful for your research, please cite our paper: ```bibtex @article{2507.23268, Author = {Shuai Wang and Ziteng Gao and Chenhui Zhu and Weilin Huang and Limin Wang}, Title = {PixNerd: Pixel Neural Field Diffusion}, Year = {2025}, Eprint = {arXiv:2507.23268}, } ``` ## Acknowledgement The code is mainly built upon [FlowDCN](https://github.com/MCG-NJU/DDT) and [DDT](https://github.com/MCG-NJU/FlowDCN).