Advancing End-to-End Pixel Space Generative Modeling via Self-supervised Pre-training
Abstract
Pixel-space generative models are often more difficult to train and generally underperform compared to their latent-space counterparts, leaving a persistent performance and efficiency gap. In this paper, we introduce a novel two-stage training framework that closes this gap for pixel-space diffusion and consistency models. In the first stage, we pre-train encoders to capture meaningful semantics from clean images while aligning them with points along the same deterministic sampling trajectory, which evolves points from the prior to the data distribution. In the second stage, we integrate the encoder with a randomly initialized decoder and fine-tune the complete model end-to-end for both diffusion and consistency models. Our training framework demonstrates strong empirical performance on ImageNet dataset. Specifically, our diffusion model reaches an FID of 2.04 on ImageNet-256 and 2.35 on ImageNet-512 with 75 number of function evaluations (NFE), surpassing prior pixel-space methods by a large margin in both generation quality and efficiency while rivaling leading VAE-based models at comparable training cost. Furthermore, on ImageNet-256, our consistency model achieves an impressive FID of 8.82 in a single sampling step, significantly surpassing its latent-space counterpart. To the best of our knowledge, this marks the first successful training of a consistency model directly on high-resolution images without relying on pre-trained VAEs or diffusion models.
Community
Hi @jiachenlei , Congratulations on the breakthrough of your EPG model in pixel-space diffusion. I am the author of PixNerd. Previously, Pixelflow and PixNerd have pushed the pixel diffusion performance frontier forward, achieving 1.98 and 1.93 FID, respectively. While these works appear to be concurrent, could you consider further discussing and comparing your EPG model with them?
PixNerd: https://huggingface.co/papers/2507.23268
Pixelflow: https://huggingface.co/papers/2504.07963
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
- Diffusion Transformers with Representation Autoencoders (2025)
- Aligning Visual Foundation Encoders to Tokenizers for Diffusion Models (2025)
- UniFlow: A Unified Pixel Flow Tokenizer for Visual Understanding and Generation (2025)
- CMT: Mid-Training for Efficient Learning of Consistency, Mean Flow, and Flow Map Models (2025)
- VUGEN: Visual Understanding priors for GENeration (2025)
- Image Tokenizer Needs Post-Training (2025)
- SSDD: Single-Step Diffusion Decoder for Efficient Image Tokenization (2025)
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment:
@librarian-bot
recommend
Models citing this paper 0
No model linking this paper
Datasets citing this paper 0
No dataset linking this paper
Spaces citing this paper 0
No Space linking this paper