Text-to-Image
Transformers

PixNerd: Pixel Neural Field Diffusion

arXiv arXiv

Introduction

The current success of diffusion transformers heavily depends on the compressed latent space shaped by the pre-trained variational autoencoder (VAE). However, this two-stage training paradigm inevitably introduces accumulated errors and decoding artifacts. To address these problems, we propose PixNerd: Pixel Neural Field Diffusion, a single-scale, single-stage, efficient, end-to-end solution for image generation.

PixNerd is a powerful and efficient pixel-space diffusion transformer that directly operates without a VAE. It employs a neural field to model patch-wise decoding, improving high-frequency modeling.

Key Highlights

  • VAE-Free Pixel Space Generation: Operates directly in pixel space, eliminating accumulated errors and decoding artifacts often introduced by VAEs.
  • High-Fidelity Image Synthesis: Achieves competitive FID scores on ImageNet benchmarks:
    • 2.15 FID on ImageNet $256\times256$ with PixNerd-XL/16.
    • 2.84 FID on ImageNet $512\times512$ with PixNerd-XL/16.
  • Competitive Text-to-Image Performance: Extends to text-to-image applications, achieving a competitive 0.73 overall score on the GenEval benchmark and 80.9 overall score on the DPG benchmark with PixNerd-XXL/16.
  • Efficient Neural Field Representation: Leverages efficient neural field representations for optimized performance.

Visualizations

Revision of the inference time statistics

Deeply sorry for this mistake, the single-step inference time of SiT-L/2 and Baseline-L is missing a zero (0.097s vs 0.0097s). The single-step inference time of PixNerd and Baseline is close. image.png

Checkpoints

Dataset Model Params FID HuggingFace
ImageNet256 PixNerd-XL/16 700M 2.15 πŸ€—
ImageNet512 PixNerd-XL/16 700M 2.84 πŸ€—
Dataset Model Params GenEval DPG HuggingFace
Text-to-Image PixNerd-XXL/16 1.2B 0.73 80.9 πŸ€—

Online Demos

We provide online demos for PixNerd-XXL/16 (text-to-image) on HuggingFace Spaces.

HF spaces: https://huggingface.co/spaces/MCG-NJU/PixNerd

To host the local gradio demo, run the following command:

# for text-to-image applications
python app.py --config configs_t2i/inference_heavydecoder.yaml  --ckpt_path=XXX.ckpt

Usage

For C2i (ImageNet), we use ADM evaluation suite to report FID.

First, install the necessary dependencies:

pip install -r requirements.txt

To run inference:

python main.py predict -c configs_c2i/pix256std1_repa_pixnerd_xl.yaml --ckpt_path=XXX.ckpt
# or specify the GPU(s) to use:
CUDA_VISIBLE_DEVICES=0,1, python main.py predict -c configs_c2i/pix256std1_repa_pixnerd_xl.yaml --ckpt_path=XXX.ckpt

For training:

python main.py fit -c configs_c2i/pix256std1_repa_pixnerd_xl.yaml

For T2i, we use GenEval and DPG to collect metrics.

Reference

@article{2507.23268,
Author = {Shuai Wang and Ziteng Gao and Chenhui Zhu and Weilin Huang and Limin Wang},
Title = {PixNerd: Pixel Neural Field Diffusion},
Year = {2025},
Eprint = {arXiv:2507.23268},
}

Acknowledgement

The code is mainly built upon FlowDCN and DDT.

Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Space using MCG-NJU/PixNerd-XXL-P16-T2I 1

Collection including MCG-NJU/PixNerd-XXL-P16-T2I