Text-to-Image
Transformers
File size: 4,699 Bytes
805872c
 
8fb7d68
 
805872c
 
8fb7d68
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
---
license: apache-2.0
pipeline_tag: text-to-image
library_name: transformers
---

# PixNerd: Pixel Neural Field Diffusion

<div style="text-align: center;">
  <a href="https://huggingface.co/papers/2507.23268"><img src="https://img.shields.io/badge/arXiv-2507.23268-b31b1b.svg" alt="arXiv"></a>
    <a href="https://huggingface.co/spaces/MCG-NJU/PixNerd"><img src="https://img.shields.io/badge/%F0%9F%A4%97%20Hugging%20Face-Online_Demo-green" alt="arXiv"></a>  
</div>

![](https://huggingface.co/MCG-NJU/PixNerd-XL-P16-C2I/resolve/main/figs/arch.png)

## Introduction
The current success of diffusion transformers heavily depends on the compressed latent space shaped by the pre-trained variational autoencoder (VAE). However, this two-stage training paradigm inevitably introduces accumulated errors and decoding artifacts. To address these problems, we propose PixNerd: Pixel Neural Field Diffusion, a single-scale, single-stage, efficient, end-to-end solution for image generation.

PixNerd is a powerful and efficient **pixel-space** diffusion transformer that directly operates without a VAE. It employs a neural field to model patch-wise decoding, improving high-frequency modeling.

### Key Highlights
*   **VAE-Free Pixel Space Generation**: Operates directly in pixel space, eliminating accumulated errors and decoding artifacts often introduced by VAEs.
*   **High-Fidelity Image Synthesis**: Achieves competitive FID scores on ImageNet benchmarks:
    *   **2.15 FID** on ImageNet $256\times256$ with PixNerd-XL/16.
    *   **2.84 FID** on ImageNet $512\times512$ with PixNerd-XL/16.
*   **Competitive Text-to-Image Performance**: Extends to text-to-image applications, achieving a competitive 0.73 overall score on the GenEval benchmark and 80.9 overall score on the DPG benchmark with PixNerd-XXL/16.
*   **Efficient Neural Field Representation**: Leverages efficient neural field representations for optimized performance.

## Visualizations
![](https://huggingface.co/MCG-NJU/PixNerd-XL-P16-C2I/resolve/main/figs/pixelnerd_teaser.png)
![](https://huggingface.co/MCG-NJU/PixNerd-XL-P16-C2I/resolve/main/figs/pixnerd_multires.png)

## Revision of the inference time statistics
Deeply sorry for this mistake, the single-step inference time of SiT-L/2 and Baseline-L is missing a zero (0.097s vs 0.0097s). The single-step inference time of PixNerd and Baseline is close.
![image.png](https://cdn-uploads.huggingface.co/production/uploads/66615c855fd9d736e670e0a9/vEGp4Lthv9JDjDa8Gvyze.png)

## Checkpoints

| Dataset       | Model         | Params | FID   | HuggingFace                           |
|---------------|---------------|--------|-------|---------------------------------------|
| ImageNet256   | PixNerd-XL/16 | 700M   | 2.15  | [🤗](https://huggingface.co/MCG-NJU/PixNerd-XL-P16-C2I) |
| ImageNet512   | PixNerd-XL/16 | 700M   | 2.84  | [🤗](https://huggingface.co/MCG-NJU/PixNerd-XL-P16-C2I) |

| Dataset       | Model         | Params | GenEval | DPG  | HuggingFace                                              |
|---------------|---------------|--------|------|------|----------------------------------------------------------|
| Text-to-Image | PixNerd-XXL/16| 1.2B | 0.73 | 80.9 | [🤗](https://huggingface.co/MCG-NJU/PixNerd-XXL-P16-T2I) |

## Online Demos
![](https://huggingface.co/MCG-NJU/PixNerd-XL-P16-C2I/resolve/main/figs/demo.png)
We provide online demos for PixNerd-XXL/16 (text-to-image) on HuggingFace Spaces.

HF spaces: [https://huggingface.co/spaces/MCG-NJU/PixNerd](https://huggingface.co/spaces/MCG-NJU/PixNerd)

To host the local gradio demo, run the following command:
```bash
# for text-to-image applications
python app.py --config configs_t2i/inference_heavydecoder.yaml  --ckpt_path=XXX.ckpt
```

## Usage
For C2i (ImageNet), we use ADM evaluation suite to report FID.

First, install the necessary dependencies:
```bash
pip install -r requirements.txt
```

To run inference:
```bash
python main.py predict -c configs_c2i/pix256std1_repa_pixnerd_xl.yaml --ckpt_path=XXX.ckpt
# or specify the GPU(s) to use:
CUDA_VISIBLE_DEVICES=0,1, python main.py predict -c configs_c2i/pix256std1_repa_pixnerd_xl.yaml --ckpt_path=XXX.ckpt
```

For training:
```bash
python main.py fit -c configs_c2i/pix256std1_repa_pixnerd_xl.yaml
```
For T2i, we use GenEval and DPG to collect metrics.

## Reference
```bibtex
@article{2507.23268,
Author = {Shuai Wang and Ziteng Gao and Chenhui Zhu and Weilin Huang and Limin Wang},
Title = {PixNerd: Pixel Neural Field Diffusion},
Year = {2025},
Eprint = {arXiv:2507.23268},
}
```

## Acknowledgement
The code is mainly built upon [FlowDCN](https://github.com/MCG-NJU/DDT) and [DDT](https://github.com/MCG-NJU/FlowDCN).