Visual Autoregressive Models Beat Diffusion Models on Inference Time Scaling
Abstract
Beam search in discrete visual autoregressive models enhances text-to-image generation more effectively than search in continuous diffusion models, highlighting architecture's importance over scale.
While inference-time scaling through search has revolutionized Large Language Models, translating these gains to image generation has proven difficult. Recent attempts to apply search strategies to continuous diffusion models show limited benefits, with simple random sampling often performing best. We demonstrate that the discrete, sequential nature of visual autoregressive models enables effective search for image generation. We show that beam search substantially improves text-to-image generation, enabling a 2B parameter autoregressive model to outperform a 12B parameter diffusion model across benchmarks. Systematic ablations show that this advantage comes from the discrete token space, which allows early pruning and computational reuse, and our verifier analysis highlights trade-offs between speed and reasoning capability. These findings suggest that model architecture, not just scale, is critical for inference-time optimization in visual generation.
Community
This work shows a 2B autoregressive model with beam search generates better compositional images than a 12B diffusion model, proving architecture trumps scale for efficient inference-time search.
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
- CARINOX: Inference-time Scaling with Category-Aware Reward-based Initial Noise Optimization and Exploration (2025)
- Go with Your Gut: Scaling Confidence for Autoregressive Image Generation (2025)
- Efficient Conditional Generation on Scale-based Visual Autoregressive Models (2025)
- Planner and Executor: Collaboration between Discrete Diffusion And Autoregressive Models in Reasoning (2025)
- Self Speculative Decoding for Diffusion Large Language Models (2025)
- Not All Tokens are Guided Equal: Improving Guidance in Visual Autoregressive Models (2025)
- EchoGen: Generating Visual Echoes in Any Scene via Feed-Forward Subject-Driven Auto-Regressive Model (2025)
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment:
@librarian-bot
recommend
Models citing this paper 0
No model linking this paper
Datasets citing this paper 0
No dataset linking this paper
Spaces citing this paper 0
No Space linking this paper