HermesFlow: Seamlessly Closing the Gap in Multimodal Understanding and Generation
Abstract
The remarkable success of the autoregressive paradigm has made significant advancement in Multimodal Large Language Models (MLLMs), with powerful models like Show-o, Transfusion and Emu3 achieving notable progress in unified image understanding and generation. For the first time, we uncover a common phenomenon: the understanding capabilities of MLLMs are typically stronger than their generative capabilities, with a significant gap between the two. Building on this insight, we propose HermesFlow, a simple yet general framework designed to seamlessly bridge the gap between understanding and generation in MLLMs. Specifically, we take the homologous data as input to curate homologous preference data of both understanding and generation. Through Pair-DPO and self-play iterative optimization, HermesFlow effectively aligns multimodal understanding and generation using homologous preference data. Extensive experiments demonstrate the significant superiority of our approach over prior methods, particularly in narrowing the gap between multimodal understanding and generation. These findings highlight the potential of HermesFlow as a general alignment framework for next-generation multimodal foundation models. Code: https://github.com/Gen-Verse/HermesFlow
Community
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
- VARGPT: Unified Understanding and Generation in a Visual Autoregressive Multimodal Large Language Model (2025)
- Re-Align: Aligning Vision Language Models via Retrieval-Augmented Direct Preference Optimization (2025)
- I Think, Therefore I Diffuse: Enabling Multimodal In-Context Reasoning in Diffusion Models (2025)
- InternVideo2.5: Empowering Video MLLMs with Long and Rich Context Modeling (2025)
- Ask in Any Modality: A Comprehensive Survey on Multimodal Retrieval-Augmented Generation (2025)
- Multimodal Preference Data Synthetic Alignment with Reward Model (2024)
- Dual Diffusion for Unified Image Generation and Understanding (2024)
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment:
@librarian-bot
recommend
Models citing this paper 1
Datasets citing this paper 0
No dataset linking this paper
Spaces citing this paper 0
No Space linking this paper