Text-to-Image
Transformers
nielsr HF Staff commited on
Commit
8fb7d68
·
verified ·
1 Parent(s): 805872c

Improve model card for PixNerd

Browse files

This PR significantly enhances the model card for PixNerd.

It adds:
- The `pipeline_tag: text-to-image` and `library_name: transformers` to the metadata, ensuring better discoverability and indicating loading compatibility.
- Direct links to the Hugging Face paper page, the project's Hugging Face Space, and the GitHub repository.
- A comprehensive model description, key features, visualizations, checkpoint information, and detailed usage examples (including installation, inference, and training commands) directly sourced from the project's GitHub README.
- The official BibTeX citation and acknowledgements.

This update makes the model card much richer and more user-friendly, providing all necessary information in one place.

Files changed (1) hide show
  1. README.md +88 -1
README.md CHANGED
@@ -1,5 +1,92 @@
1
  ---
2
  license: apache-2.0
 
 
3
  ---
4
 
5
- arxiv.org/abs/2507.23268
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
  ---
2
  license: apache-2.0
3
+ pipeline_tag: text-to-image
4
+ library_name: transformers
5
  ---
6
 
7
+ # PixNerd: Pixel Neural Field Diffusion
8
+
9
+ <div style="text-align: center;">
10
+ <a href="https://huggingface.co/papers/2507.23268"><img src="https://img.shields.io/badge/arXiv-2507.23268-b31b1b.svg" alt="arXiv"></a>
11
+ <a href="https://huggingface.co/spaces/MCG-NJU/PixNerd"><img src="https://img.shields.io/badge/%F0%9F%A4%97%20Hugging%20Face-Online_Demo-green" alt="arXiv"></a>
12
+ </div>
13
+
14
+ ![](https://huggingface.co/MCG-NJU/PixNerd-XL-P16-C2I/resolve/main/figs/arch.png)
15
+
16
+ ## Introduction
17
+ The current success of diffusion transformers heavily depends on the compressed latent space shaped by the pre-trained variational autoencoder (VAE). However, this two-stage training paradigm inevitably introduces accumulated errors and decoding artifacts. To address these problems, we propose PixNerd: Pixel Neural Field Diffusion, a single-scale, single-stage, efficient, end-to-end solution for image generation.
18
+
19
+ PixNerd is a powerful and efficient **pixel-space** diffusion transformer that directly operates without a VAE. It employs a neural field to model patch-wise decoding, improving high-frequency modeling.
20
+
21
+ ### Key Highlights
22
+ * **VAE-Free Pixel Space Generation**: Operates directly in pixel space, eliminating accumulated errors and decoding artifacts often introduced by VAEs.
23
+ * **High-Fidelity Image Synthesis**: Achieves competitive FID scores on ImageNet benchmarks:
24
+ * **2.15 FID** on ImageNet $256\times256$ with PixNerd-XL/16.
25
+ * **2.84 FID** on ImageNet $512\times512$ with PixNerd-XL/16.
26
+ * **Competitive Text-to-Image Performance**: Extends to text-to-image applications, achieving a competitive 0.73 overall score on the GenEval benchmark and 80.9 overall score on the DPG benchmark with PixNerd-XXL/16.
27
+ * **Efficient Neural Field Representation**: Leverages efficient neural field representations for optimized performance.
28
+
29
+ ## Visualizations
30
+ ![](https://huggingface.co/MCG-NJU/PixNerd-XL-P16-C2I/resolve/main/figs/pixelnerd_teaser.png)
31
+ ![](https://huggingface.co/MCG-NJU/PixNerd-XL-P16-C2I/resolve/main/figs/pixnerd_multires.png)
32
+
33
+ ## Revision of the inference time statistics
34
+ Deeply sorry for this mistake, the single-step inference time of SiT-L/2 and Baseline-L is missing a zero (0.097s vs 0.0097s). The single-step inference time of PixNerd and Baseline is close.
35
+ ![image.png](https://cdn-uploads.huggingface.co/production/uploads/66615c855fd9d736e670e0a9/vEGp4Lthv9JDjDa8Gvyze.png)
36
+
37
+ ## Checkpoints
38
+
39
+ | Dataset | Model | Params | FID | HuggingFace |
40
+ |---------------|---------------|--------|-------|---------------------------------------|
41
+ | ImageNet256 | PixNerd-XL/16 | 700M | 2.15 | [🤗](https://huggingface.co/MCG-NJU/PixNerd-XL-P16-C2I) |
42
+ | ImageNet512 | PixNerd-XL/16 | 700M | 2.84 | [🤗](https://huggingface.co/MCG-NJU/PixNerd-XL-P16-C2I) |
43
+
44
+ | Dataset | Model | Params | GenEval | DPG | HuggingFace |
45
+ |---------------|---------------|--------|------|------|----------------------------------------------------------|
46
+ | Text-to-Image | PixNerd-XXL/16| 1.2B | 0.73 | 80.9 | [🤗](https://huggingface.co/MCG-NJU/PixNerd-XXL-P16-T2I) |
47
+
48
+ ## Online Demos
49
+ ![](https://huggingface.co/MCG-NJU/PixNerd-XL-P16-C2I/resolve/main/figs/demo.png)
50
+ We provide online demos for PixNerd-XXL/16 (text-to-image) on HuggingFace Spaces.
51
+
52
+ HF spaces: [https://huggingface.co/spaces/MCG-NJU/PixNerd](https://huggingface.co/spaces/MCG-NJU/PixNerd)
53
+
54
+ To host the local gradio demo, run the following command:
55
+ ```bash
56
+ # for text-to-image applications
57
+ python app.py --config configs_t2i/inference_heavydecoder.yaml --ckpt_path=XXX.ckpt
58
+ ```
59
+
60
+ ## Usage
61
+ For C2i (ImageNet), we use ADM evaluation suite to report FID.
62
+
63
+ First, install the necessary dependencies:
64
+ ```bash
65
+ pip install -r requirements.txt
66
+ ```
67
+
68
+ To run inference:
69
+ ```bash
70
+ python main.py predict -c configs_c2i/pix256std1_repa_pixnerd_xl.yaml --ckpt_path=XXX.ckpt
71
+ # or specify the GPU(s) to use:
72
+ CUDA_VISIBLE_DEVICES=0,1, python main.py predict -c configs_c2i/pix256std1_repa_pixnerd_xl.yaml --ckpt_path=XXX.ckpt
73
+ ```
74
+
75
+ For training:
76
+ ```bash
77
+ python main.py fit -c configs_c2i/pix256std1_repa_pixnerd_xl.yaml
78
+ ```
79
+ For T2i, we use GenEval and DPG to collect metrics.
80
+
81
+ ## Reference
82
+ ```bibtex
83
+ @article{2507.23268,
84
+ Author = {Shuai Wang and Ziteng Gao and Chenhui Zhu and Weilin Huang and Limin Wang},
85
+ Title = {PixNerd: Pixel Neural Field Diffusion},
86
+ Year = {2025},
87
+ Eprint = {arXiv:2507.23268},
88
+ }
89
+ ```
90
+
91
+ ## Acknowledgement
92
+ The code is mainly built upon [FlowDCN](https://github.com/MCG-NJU/DDT) and [DDT](https://github.com/MCG-NJU/FlowDCN).