Unconditional Image Generation
File size: 4,870 Bytes
0046af1
 
f3f9510
0046af1
 
f3f9510
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
---
license: apache-2.0
pipeline_tag: unconditional-image-generation
---

# PixNerd: Pixel Neural Field Diffusion

<div style="text-align: center;">
  <a href="https://huggingface.co/papers/2507.23268"><img src="https://img.shields.io/badge/Paper-2507.23268-b31b1b.svg" alt="Paper"></a>
  <a href="https://github.com/MCG-NJU/PixNerd"><img src="https://img.shields.io/badge/GitHub-Code-blue.svg?logo=github&" alt="Code"></a>
  <a href="https://huggingface.co/spaces/MCG-NJU/PixNerd"><img src="https://img.shields.io/badge/%F0%9F%A4%97%20Hugging%20Face-Online_Demo-green" alt="Demo"></a>
</div>

PixNerd is a novel pixel-space diffusion transformer for image generation, introduced in the paper [PixNerd: Pixel Neural Field Diffusion](https://huggingface.co/papers/2507.23268). Unlike conventional diffusion models that depend on a compressed latent space shaped by a pre-trained VAE, PixNerd proposes to model patch-wise decoding with a neural field. This results in a single-scale, single-stage, efficient, and end-to-end solution that directly operates in pixel space, avoiding accumulated errors and decoding artifacts.

<p align="center">
  <img src="https://huggingface.co/MCG-NJU/PixNerd/resolve/main/figs/arch.png" alt="PixNerd Architecture Diagram" width="700">
</p>

### ✨ Key Highlights

*   **Efficient Pixel-Space Diffusion**: Directly models image generation in pixel space, eliminating the need for VAEs and their associated complexities or artifacts.
*   **Neural Field Decoding**: Employs neural fields for patch-wise decoding, improving the modeling of high-frequency details.
*   **Single-Stage & End-to-End**: Offers a simplified, efficient training and inference paradigm without complex cascade pipelines.
*   **High Performance**: Achieves competitive FID scores on ImageNet 256x256 (2.15 FID) and 512x512 (2.84 FID) for unconditional image generation.
*   **Text-to-Image Extension**: The framework is extensible to text-to-image applications, achieving strong results on benchmarks like GenEval (0.73 overall score) and DPG (80.9 overall score).

## Visualizations

Below are sample images generated by PixNerd, showcasing its capabilities:

<p align="center">
  <img src="https://huggingface.co/MCG-NJU/PixNerd/resolve/main/figs/pixelnerd_teaser.png" alt="PixNerd Teaser" width="700">
  <br/>
  <img src="https://huggingface.co/MCG-NJU/PixNerd/resolve/main/figs/pixnerd_multires.png" alt="PixNerd Multi-Resolution Examples" width="700">
</p>

## Checkpoints

The following checkpoints are available:

| Dataset       | Model         | Params | FID   | HuggingFace                           |
|---------------|---------------|--------|-------|---------------------------------------|
| ImageNet256   | PixNerd-XL/16 | 700M   | 2.15  | [🤗](https://huggingface.co/MCG-NJU/PixNerd-XL-P16-C2I) |
| ImageNet512   | PixNerd-XL/16 | 700M   | 2.84  | [🤗](https://huggingface.co/MCG-NJU/PixNerd-XL-P16-C2I) |

| Dataset       | Model         | Params | GenEval | DPG  | HuggingFace                                              |
|---------------|---------------|--------|------|------|----------------------------------------------------------|
| Text-to-Image | PixNerd-XXL/16| 1.2B | 0.73 | 80.9 | [🤗](https://huggingface.co/MCG-NJU/PixNerd-XXL-P16-T2I) |

## Online Demos

You can try out the PixNerd-XXL/16 (text-to-image) model on our Hugging Face Space demo: [https://huggingface.co/spaces/MCG-NJU/PixNerd](https://huggingface.co/spaces/MCG-NJU/PixNerd).

To host a local Gradio demo for text-to-image applications, run the following command after setting up the environment:

```bash
python app.py --config configs_t2i/inference_heavydecoder.yaml  --ckpt_path=XXX.ckpt
```

## Usage

For image generation (C2i for ImageNet), you can use the provided codebase. First, install the required dependencies:

```bash
# for installation
pip install -r requirements.txt
```

Then, run inference using the `main.py` script (replace `XXX.ckpt` with your checkpoint path):

```bash
# for inference
python main.py predict -c configs_c2i/pix256std1_repa_pixnerd_xl.yaml --ckpt_path=XXX.ckpt
# or specify the GPU(s) to use:
CUDA_VISIBLE_DEVICES=0,1, python main.py predict -c configs_c2i/pix256std1_repa_pixnerd_xl.yaml --ckpt_path=XXX.ckpt
```

For more details on training and evaluation for both C2i and T2i applications, please refer to the [official GitHub repository](https://github.com/MCG-NJU/PixNerd).

## Citation

If you find this work useful for your research, please cite our paper:

```bibtex
@article{2507.23268,
Author = {Shuai Wang and Ziteng Gao and Chenhui Zhu and Weilin Huang and Limin Wang},
Title = {PixNerd: Pixel Neural Field Diffusion},
Year = {2025},
Eprint = {arXiv:2507.23268},
}
```

## Acknowledgement

The code is mainly built upon [FlowDCN](https://github.com/MCG-NJU/DDT) and [DDT](https://github.com/MCG-NJU/FlowDCN).