|
--- |
|
license: apache-2.0 |
|
pipeline_tag: text-to-image |
|
library_name: transformers |
|
--- |
|
|
|
# PixNerd: Pixel Neural Field Diffusion |
|
|
|
<div style="text-align: center;"> |
|
<a href="https://huggingface.co/papers/2507.23268"><img src="https://img.shields.io/badge/arXiv-2507.23268-b31b1b.svg" alt="arXiv"></a> |
|
<a href="https://huggingface.co/spaces/MCG-NJU/PixNerd"><img src="https://img.shields.io/badge/%F0%9F%A4%97%20Hugging%20Face-Online_Demo-green" alt="arXiv"></a> |
|
</div> |
|
|
|
 |
|
|
|
## Introduction |
|
The current success of diffusion transformers heavily depends on the compressed latent space shaped by the pre-trained variational autoencoder (VAE). However, this two-stage training paradigm inevitably introduces accumulated errors and decoding artifacts. To address these problems, we propose PixNerd: Pixel Neural Field Diffusion, a single-scale, single-stage, efficient, end-to-end solution for image generation. |
|
|
|
PixNerd is a powerful and efficient **pixel-space** diffusion transformer that directly operates without a VAE. It employs a neural field to model patch-wise decoding, improving high-frequency modeling. |
|
|
|
### Key Highlights |
|
* **VAE-Free Pixel Space Generation**: Operates directly in pixel space, eliminating accumulated errors and decoding artifacts often introduced by VAEs. |
|
* **High-Fidelity Image Synthesis**: Achieves competitive FID scores on ImageNet benchmarks: |
|
* **2.15 FID** on ImageNet $256\times256$ with PixNerd-XL/16. |
|
* **2.84 FID** on ImageNet $512\times512$ with PixNerd-XL/16. |
|
* **Competitive Text-to-Image Performance**: Extends to text-to-image applications, achieving a competitive 0.73 overall score on the GenEval benchmark and 80.9 overall score on the DPG benchmark with PixNerd-XXL/16. |
|
* **Efficient Neural Field Representation**: Leverages efficient neural field representations for optimized performance. |
|
|
|
## Visualizations |
|
 |
|
 |
|
|
|
## Revision of the inference time statistics |
|
Deeply sorry for this mistake, the single-step inference time of SiT-L/2 and Baseline-L is missing a zero (0.097s vs 0.0097s). The single-step inference time of PixNerd and Baseline is close. |
|
 |
|
|
|
## Checkpoints |
|
|
|
| Dataset | Model | Params | FID | HuggingFace | |
|
|---------------|---------------|--------|-------|---------------------------------------| |
|
| ImageNet256 | PixNerd-XL/16 | 700M | 2.15 | [π€](https://huggingface.co/MCG-NJU/PixNerd-XL-P16-C2I) | |
|
| ImageNet512 | PixNerd-XL/16 | 700M | 2.84 | [π€](https://huggingface.co/MCG-NJU/PixNerd-XL-P16-C2I) | |
|
|
|
| Dataset | Model | Params | GenEval | DPG | HuggingFace | |
|
|---------------|---------------|--------|------|------|----------------------------------------------------------| |
|
| Text-to-Image | PixNerd-XXL/16| 1.2B | 0.73 | 80.9 | [π€](https://huggingface.co/MCG-NJU/PixNerd-XXL-P16-T2I) | |
|
|
|
## Online Demos |
|
 |
|
We provide online demos for PixNerd-XXL/16 (text-to-image) on HuggingFace Spaces. |
|
|
|
HF spaces: [https://huggingface.co/spaces/MCG-NJU/PixNerd](https://huggingface.co/spaces/MCG-NJU/PixNerd) |
|
|
|
To host the local gradio demo, run the following command: |
|
```bash |
|
# for text-to-image applications |
|
python app.py --config configs_t2i/inference_heavydecoder.yaml --ckpt_path=XXX.ckpt |
|
``` |
|
|
|
## Usage |
|
For C2i (ImageNet), we use ADM evaluation suite to report FID. |
|
|
|
First, install the necessary dependencies: |
|
```bash |
|
pip install -r requirements.txt |
|
``` |
|
|
|
To run inference: |
|
```bash |
|
python main.py predict -c configs_c2i/pix256std1_repa_pixnerd_xl.yaml --ckpt_path=XXX.ckpt |
|
# or specify the GPU(s) to use: |
|
CUDA_VISIBLE_DEVICES=0,1, python main.py predict -c configs_c2i/pix256std1_repa_pixnerd_xl.yaml --ckpt_path=XXX.ckpt |
|
``` |
|
|
|
For training: |
|
```bash |
|
python main.py fit -c configs_c2i/pix256std1_repa_pixnerd_xl.yaml |
|
``` |
|
For T2i, we use GenEval and DPG to collect metrics. |
|
|
|
## Reference |
|
```bibtex |
|
@article{2507.23268, |
|
Author = {Shuai Wang and Ziteng Gao and Chenhui Zhu and Weilin Huang and Limin Wang}, |
|
Title = {PixNerd: Pixel Neural Field Diffusion}, |
|
Year = {2025}, |
|
Eprint = {arXiv:2507.23268}, |
|
} |
|
``` |
|
|
|
## Acknowledgement |
|
The code is mainly built upon [FlowDCN](https://github.com/MCG-NJU/DDT) and [DDT](https://github.com/MCG-NJU/FlowDCN). |