Improve model card for PixNerd (#1)

9fa8836 verified 18 days ago

4.7 kB

	---
	license: apache-2.0
	pipeline_tag: text-to-image
	library_name: transformers
	---

	# PixNerd: Pixel Neural Field Diffusion

	<div style="text-align: center;">
	<a href="https://huggingface.co/papers/2507.23268"><img src="https://img.shields.io/badge/arXiv-2507.23268-b31b1b.svg" alt="arXiv"></a>
	<a href="https://huggingface.co/spaces/MCG-NJU/PixNerd"><img src="https://img.shields.io/badge/%F0%9F%A4%97%20Hugging%20Face-Online_Demo-green" alt="arXiv"></a>
	</div>

	![](https://huggingface.co/MCG-NJU/PixNerd-XL-P16-C2I/resolve/main/figs/arch.png)

	## Introduction
	The current success of diffusion transformers heavily depends on the compressed latent space shaped by the pre-trained variational autoencoder (VAE). However, this two-stage training paradigm inevitably introduces accumulated errors and decoding artifacts. To address these problems, we propose PixNerd: Pixel Neural Field Diffusion, a single-scale, single-stage, efficient, end-to-end solution for image generation.

	PixNerd is a powerful and efficient pixel-space diffusion transformer that directly operates without a VAE. It employs a neural field to model patch-wise decoding, improving high-frequency modeling.

	### Key Highlights
	* VAE-Free Pixel Space Generation: Operates directly in pixel space, eliminating accumulated errors and decoding artifacts often introduced by VAEs.
	* High-Fidelity Image Synthesis: Achieves competitive FID scores on ImageNet benchmarks:
	* 2.15 FID on ImageNet $256\times256$ with PixNerd-XL/16.
	* 2.84 FID on ImageNet $512\times512$ with PixNerd-XL/16.
	* Competitive Text-to-Image Performance: Extends to text-to-image applications, achieving a competitive 0.73 overall score on the GenEval benchmark and 80.9 overall score on the DPG benchmark with PixNerd-XXL/16.
	* Efficient Neural Field Representation: Leverages efficient neural field representations for optimized performance.

	## Visualizations
	![](https://huggingface.co/MCG-NJU/PixNerd-XL-P16-C2I/resolve/main/figs/pixelnerd_teaser.png)
	![](https://huggingface.co/MCG-NJU/PixNerd-XL-P16-C2I/resolve/main/figs/pixnerd_multires.png)

	## Revision of the inference time statistics
	Deeply sorry for this mistake, the single-step inference time of SiT-L/2 and Baseline-L is missing a zero (0.097s vs 0.0097s). The single-step inference time of PixNerd and Baseline is close.
	![image.png](https://cdn-uploads.huggingface.co/production/uploads/66615c855fd9d736e670e0a9/vEGp4Lthv9JDjDa8Gvyze.png)

	## Checkpoints

	\| Dataset \| Model \| Params \| FID \| HuggingFace \|
	\|---------------\|---------------\|--------\|-------\|---------------------------------------\|
	\| ImageNet256 \| PixNerd-XL/16 \| 700M \| 2.15 \| [🤗](https://huggingface.co/MCG-NJU/PixNerd-XL-P16-C2I) \|
	\| ImageNet512 \| PixNerd-XL/16 \| 700M \| 2.84 \| [🤗](https://huggingface.co/MCG-NJU/PixNerd-XL-P16-C2I) \|

	\| Dataset \| Model \| Params \| GenEval \| DPG \| HuggingFace \|
	\|---------------\|---------------\|--------\|------\|------\|----------------------------------------------------------\|
	\| Text-to-Image \| PixNerd-XXL/16\| 1.2B \| 0.73 \| 80.9 \| [🤗](https://huggingface.co/MCG-NJU/PixNerd-XXL-P16-T2I) \|

	## Online Demos
	![](https://huggingface.co/MCG-NJU/PixNerd-XL-P16-C2I/resolve/main/figs/demo.png)
	We provide online demos for PixNerd-XXL/16 (text-to-image) on HuggingFace Spaces.

	HF spaces: [https://huggingface.co/spaces/MCG-NJU/PixNerd](https://huggingface.co/spaces/MCG-NJU/PixNerd)

	To host the local gradio demo, run the following command:
	```bash
	# for text-to-image applications
	python app.py --config configs_t2i/inference_heavydecoder.yaml --ckpt_path=XXX.ckpt
	```

	## Usage
	For C2i (ImageNet), we use ADM evaluation suite to report FID.

	First, install the necessary dependencies:
	```bash
	pip install -r requirements.txt
	```

	To run inference:
	```bash
	python main.py predict -c configs_c2i/pix256std1_repa_pixnerd_xl.yaml --ckpt_path=XXX.ckpt
	# or specify the GPU(s) to use:
	CUDA_VISIBLE_DEVICES=0,1, python main.py predict -c configs_c2i/pix256std1_repa_pixnerd_xl.yaml --ckpt_path=XXX.ckpt
	```

	For training:
	```bash
	python main.py fit -c configs_c2i/pix256std1_repa_pixnerd_xl.yaml
	```
	For T2i, we use GenEval and DPG to collect metrics.

	## Reference
	```bibtex
	@article{2507.23268,
	Author = {Shuai Wang and Ziteng Gao and Chenhui Zhu and Weilin Huang and Limin Wang},
	Title = {PixNerd: Pixel Neural Field Diffusion},
	Year = {2025},
	Eprint = {arXiv:2507.23268},
	}
	```

	## Acknowledgement
	The code is mainly built upon [FlowDCN](https://github.com/MCG-NJU/DDT) and [DDT](https://github.com/MCG-NJU/FlowDCN).