Can Understanding and Generation Truly Benefit Together -- or Just Coexist?
Abstract
A novel framework UAE uses reinforcement learning to unify image-to-text and text-to-image processes, enhancing mutual understanding and generation fidelity.
In this paper, we introduce an insightful paradigm through the Auto-Encoder lens-understanding as the encoder (I2T) that compresses images into text, and generation as the decoder (T2I) that reconstructs images from that text. Using reconstruction fidelity as the unified training objective, we enforce the coherent bidirectional information flow between the understanding and generation processes, bringing mutual gains. To implement this, we propose UAE, a novel framework for unified multimodal learning. We begin by pre-training the decoder with large-scale long-context image captions to capture fine-grained semantic and complex spatial relationships. We then propose Unified-GRPO via reinforcement learning (RL), which covers three stages: (1) A cold-start phase to gently initialize both encoder and decoder with a semantic reconstruction loss; (2) Generation for Understanding, where the encoder is trained to generate informative captions that maximize the decoder's reconstruction quality, enhancing its visual understanding; (3) Understanding for Generation, where the decoder is refined to reconstruct from these captions, forcing it to leverage every detail and improving its long-context instruction following and generation fidelity. For evaluation, we introduce Unified-Bench, the first benchmark tailored to assess the degree of unification of the UMMs. A surprising "aha moment" arises within the multimodal learning domain: as RL progresses, the encoder autonomously produces more descriptive captions, while the decoder simultaneously demonstrates a profound ability to understand these intricate descriptions, resulting in reconstructions of striking fidelity.
Community
🔥🔥🔥Understanding ↔ Generation can boost each other — not just coexist! Framed as an autoencoder (I2T=encoder, T2I=decoder) and trained with Unified-GRPO (RL).
🧠🧠🧠Result: encoder writes richer captions, decoder reconstructs with striking fidelity.
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
- OneCAT: Decoder-Only Auto-Regressive Model for Unified Understanding and Generation (2025)
- Skywork UniPic 2.0: Building Kontext Model with Online RL for Unified Multimodal Model (2025)
- Reconstruction Alignment Improves Unified Multimodal Models (2025)
- UniEdit-I: Training-free Image Editing for Unified VLM via Iterative Understanding, Editing and Verifying (2025)
- Skywork UniPic: Unified Autoregressive Modeling for Visual Understanding and Generation (2025)
- UniLiP: Adapting CLIP for Unified Multimodal Understanding, Generation and Editing (2025)
- UniUGG: Unified 3D Understanding and Generation via Geometric-Semantic Encoding (2025)
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment:
@librarian-bot
recommend
Thanks!
Models citing this paper 1
Datasets citing this paper 0
No dataset linking this paper
Spaces citing this paper 0
No Space linking this paper