Abstract
PixelDiT is a single-stage, end-to-end diffusion model that operates directly in pixel space, overcoming the limitations of latent-space modeling by using a dual-level transformer architecture and achieving competitive performance in image and text-to-image generation.
Latent-space modeling has been the standard for Diffusion Transformers (DiTs). However, it relies on a two-stage pipeline where the pretrained autoencoder introduces lossy reconstruction, leading to error accumulation while hindering joint optimization. To address these issues, we propose PixelDiT, a single-stage, end-to-end model that eliminates the need for the autoencoder and learns the diffusion process directly in the pixel space. PixelDiT adopts a fully transformer-based architecture shaped by a dual-level design: a patch-level DiT that captures global semantics and a pixel-level DiT that refines texture details, enabling efficient training of a pixel-space diffusion model while preserving fine details. Our analysis reveals that effective pixel-level token modeling is essential to the success of pixel diffusion. PixelDiT achieves 1.61 FID on ImageNet 256x256, surpassing existing pixel generative models by a large margin. We further extend PixelDiT to text-to-image generation and pretrain it at the 1024x1024 resolution in pixel space. It achieves 0.74 on GenEval and 83.5 on DPG-bench, approaching the best latent diffusion models.
Community
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
- DeCo: Frequency-Decoupled Pixel Diffusion for End-to-End Image Generation (2025)
- DiP: Taming Diffusion Models in Pixel Space (2025)
- Diffusion Transformers with Representation Autoencoders (2025)
- ScaleDiff: Higher-Resolution Image Synthesis via Efficient and Model-Agnostic Diffusion (2025)
- Adapting Self-Supervised Representations as a Latent Space for Efficient Generation (2025)
- Pixel-Perfect Depth with Semantics-Prompted Diffusion Transformers (2025)
- Sprint: Sparse-Dense Residual Fusion for Efficient Diffusion Transformers (2025)
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment:
@librarian-bot
recommend
Models citing this paper 0
No model linking this paper
Datasets citing this paper 0
No dataset linking this paper
Spaces citing this paper 0
No Space linking this paper
Collections including this paper 0
No Collection including this paper