Improve model card for PixNerd
#1
by
nielsr
HF Staff
- opened
README.md
CHANGED
@@ -1,5 +1,92 @@
|
|
1 |
---
|
2 |
license: apache-2.0
|
|
|
|
|
3 |
---
|
4 |
|
5 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
---
|
2 |
license: apache-2.0
|
3 |
+
pipeline_tag: text-to-image
|
4 |
+
library_name: transformers
|
5 |
---
|
6 |
|
7 |
+
# PixNerd: Pixel Neural Field Diffusion
|
8 |
+
|
9 |
+
<div style="text-align: center;">
|
10 |
+
<a href="https://huggingface.co/papers/2507.23268"><img src="https://img.shields.io/badge/arXiv-2507.23268-b31b1b.svg" alt="arXiv"></a>
|
11 |
+
<a href="https://huggingface.co/spaces/MCG-NJU/PixNerd"><img src="https://img.shields.io/badge/%F0%9F%A4%97%20Hugging%20Face-Online_Demo-green" alt="arXiv"></a>
|
12 |
+
</div>
|
13 |
+
|
14 |
+

|
15 |
+
|
16 |
+
## Introduction
|
17 |
+
The current success of diffusion transformers heavily depends on the compressed latent space shaped by the pre-trained variational autoencoder (VAE). However, this two-stage training paradigm inevitably introduces accumulated errors and decoding artifacts. To address these problems, we propose PixNerd: Pixel Neural Field Diffusion, a single-scale, single-stage, efficient, end-to-end solution for image generation.
|
18 |
+
|
19 |
+
PixNerd is a powerful and efficient **pixel-space** diffusion transformer that directly operates without a VAE. It employs a neural field to model patch-wise decoding, improving high-frequency modeling.
|
20 |
+
|
21 |
+
### Key Highlights
|
22 |
+
* **VAE-Free Pixel Space Generation**: Operates directly in pixel space, eliminating accumulated errors and decoding artifacts often introduced by VAEs.
|
23 |
+
* **High-Fidelity Image Synthesis**: Achieves competitive FID scores on ImageNet benchmarks:
|
24 |
+
* **2.15 FID** on ImageNet $256\times256$ with PixNerd-XL/16.
|
25 |
+
* **2.84 FID** on ImageNet $512\times512$ with PixNerd-XL/16.
|
26 |
+
* **Competitive Text-to-Image Performance**: Extends to text-to-image applications, achieving a competitive 0.73 overall score on the GenEval benchmark and 80.9 overall score on the DPG benchmark with PixNerd-XXL/16.
|
27 |
+
* **Efficient Neural Field Representation**: Leverages efficient neural field representations for optimized performance.
|
28 |
+
|
29 |
+
## Visualizations
|
30 |
+

|
31 |
+

|
32 |
+
|
33 |
+
## Revision of the inference time statistics
|
34 |
+
Deeply sorry for this mistake, the single-step inference time of SiT-L/2 and Baseline-L is missing a zero (0.097s vs 0.0097s). The single-step inference time of PixNerd and Baseline is close.
|
35 |
+

|
36 |
+
|
37 |
+
## Checkpoints
|
38 |
+
|
39 |
+
| Dataset | Model | Params | FID | HuggingFace |
|
40 |
+
|---------------|---------------|--------|-------|---------------------------------------|
|
41 |
+
| ImageNet256 | PixNerd-XL/16 | 700M | 2.15 | [🤗](https://huggingface.co/MCG-NJU/PixNerd-XL-P16-C2I) |
|
42 |
+
| ImageNet512 | PixNerd-XL/16 | 700M | 2.84 | [🤗](https://huggingface.co/MCG-NJU/PixNerd-XL-P16-C2I) |
|
43 |
+
|
44 |
+
| Dataset | Model | Params | GenEval | DPG | HuggingFace |
|
45 |
+
|---------------|---------------|--------|------|------|----------------------------------------------------------|
|
46 |
+
| Text-to-Image | PixNerd-XXL/16| 1.2B | 0.73 | 80.9 | [🤗](https://huggingface.co/MCG-NJU/PixNerd-XXL-P16-T2I) |
|
47 |
+
|
48 |
+
## Online Demos
|
49 |
+

|
50 |
+
We provide online demos for PixNerd-XXL/16 (text-to-image) on HuggingFace Spaces.
|
51 |
+
|
52 |
+
HF spaces: [https://huggingface.co/spaces/MCG-NJU/PixNerd](https://huggingface.co/spaces/MCG-NJU/PixNerd)
|
53 |
+
|
54 |
+
To host the local gradio demo, run the following command:
|
55 |
+
```bash
|
56 |
+
# for text-to-image applications
|
57 |
+
python app.py --config configs_t2i/inference_heavydecoder.yaml --ckpt_path=XXX.ckpt
|
58 |
+
```
|
59 |
+
|
60 |
+
## Usage
|
61 |
+
For C2i (ImageNet), we use ADM evaluation suite to report FID.
|
62 |
+
|
63 |
+
First, install the necessary dependencies:
|
64 |
+
```bash
|
65 |
+
pip install -r requirements.txt
|
66 |
+
```
|
67 |
+
|
68 |
+
To run inference:
|
69 |
+
```bash
|
70 |
+
python main.py predict -c configs_c2i/pix256std1_repa_pixnerd_xl.yaml --ckpt_path=XXX.ckpt
|
71 |
+
# or specify the GPU(s) to use:
|
72 |
+
CUDA_VISIBLE_DEVICES=0,1, python main.py predict -c configs_c2i/pix256std1_repa_pixnerd_xl.yaml --ckpt_path=XXX.ckpt
|
73 |
+
```
|
74 |
+
|
75 |
+
For training:
|
76 |
+
```bash
|
77 |
+
python main.py fit -c configs_c2i/pix256std1_repa_pixnerd_xl.yaml
|
78 |
+
```
|
79 |
+
For T2i, we use GenEval and DPG to collect metrics.
|
80 |
+
|
81 |
+
## Reference
|
82 |
+
```bibtex
|
83 |
+
@article{2507.23268,
|
84 |
+
Author = {Shuai Wang and Ziteng Gao and Chenhui Zhu and Weilin Huang and Limin Wang},
|
85 |
+
Title = {PixNerd: Pixel Neural Field Diffusion},
|
86 |
+
Year = {2025},
|
87 |
+
Eprint = {arXiv:2507.23268},
|
88 |
+
}
|
89 |
+
```
|
90 |
+
|
91 |
+
## Acknowledgement
|
92 |
+
The code is mainly built upon [FlowDCN](https://github.com/MCG-NJU/DDT) and [DDT](https://github.com/MCG-NJU/FlowDCN).
|