CoRe^2: Collect, Reflect and Refine to Generate Better and Faster
Abstract
Making text-to-image (T2I) generative model sample both fast and well represents a promising research direction. Previous studies have typically focused on either enhancing the visual quality of synthesized images at the expense of sampling efficiency or dramatically accelerating sampling without improving the base model's generative capacity. Moreover, nearly all inference methods have not been able to ensure stable performance simultaneously on both diffusion models (DMs) and visual autoregressive models (ARMs). In this paper, we introduce a novel plug-and-play inference paradigm, CoRe^2, which comprises three subprocesses: Collect, Reflect, and Refine. CoRe^2 first collects classifier-free guidance (CFG) trajectories, and then use collected data to train a weak model that reflects the easy-to-learn contents while reducing number of function evaluations during inference by half. Subsequently, CoRe^2 employs weak-to-strong guidance to refine the conditional output, thereby improving the model's capacity to generate high-frequency and realistic content, which is difficult for the base model to capture. To the best of our knowledge, CoRe^2 is the first to demonstrate both efficiency and effectiveness across a wide range of DMs, including SDXL, SD3.5, and FLUX, as well as ARMs like LlamaGen. It has exhibited significant performance improvements on HPD v2, Pick-of-Pic, Drawbench, GenEval, and T2I-Compbench. Furthermore, CoRe^2 can be seamlessly integrated with the state-of-the-art Z-Sampling, outperforming it by 0.3 and 0.16 on PickScore and AES, while achieving 5.64s time saving using SD3.5.Code is released at https://github.com/xie-lab-ml/CoRe/tree/main.
Community
Are you still troubled by the poor performance of inference-enhanced algorithms on large-scale flow-based diffusion models, particularly on SDXL? Are you struggling to scale such algorithms to visual autoregressive models? Are you anxious about waiting for the high computational cost of inference-enhanced algorithms?
In this work, we propose CoRe^2, a novel plug-and-play inference paradigm that addresses these challenges through three key subprocesses: Collect, Reflect, and Refine.
- Collect: CoRe^2 begins by collecting classifier-free guidance (CFG) trajectories.
- Reflect: Using the collected data, it trains a weak model to reflect the easy-to-learn content, halving the number of function evaluations during inference.
- Refine: Finally, CoRe^2 utilizes weak-to-strong guidance to refine the conditional output, significantly enhancing the model's ability to generate high-frequency and realistic details that are often challenging for the base model to capture.
To the best of our knowledge, CoRe^2 is the first inference paradigm to demonstrate both efficiency and effectiveness across a variety of diffusion models (DMs), including SDXL, SD3.5, and FLUX, as well as autoregressive models (ARMs) like LlamaGen. It has achieved significant performance gains on benchmarks such as HPD v2, Pick-of-Pic, Drawbench, GenEval, and T2I-Compbench.
Moreover, CoRe^2 can be seamlessly integrated with state-of-the-art techniques like Z-Sampling, outperforming it by 0.3 and 0.16 on PickScore and AES metrics, respectively, while achieving a time saving of 5.64 seconds.
The official implementation of CoRe^2, along with the released code, is available at GitHub - xie-lab-ml/CoRe.
SDXL-> SD3.5
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
- GenDR: Lightning Generative Detail Restorator (2025)
- Adding Additional Control to One-Step Diffusion with Joint Distribution Matching (2025)
- LightGen: Efficient Image Generation through Knowledge Distillation and Direct Preference Optimization (2025)
- Boosting Diffusion-Based Text Image Super-Resolution Model Towards Generalized Real-World Scenarios (2025)
- MGHanD: Multi-modal Guidance for authentic Hand Diffusion (2025)
- CHATS: Combining Human-Aligned Optimization and Test-Time Sampling for Text-to-Image Generation (2025)
- Layton: Latent Consistency Tokenizer for 1024-pixel Image Reconstruction and Generation by 256 Tokens (2025)
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment:
@librarian-bot
recommend
Models citing this paper 0
No model linking this paper
Datasets citing this paper 0
No dataset linking this paper
Spaces citing this paper 0
No Space linking this paper
Collections including this paper 0
No Collection including this paper