Abstract
MiMo-VL-7B-SFT and MiMo-VL-7B-RL provide state-of-the-art general visual understanding and multimodal reasoning through four-stage pre-training and Mixed On-policy Reinforcement Learning, outperforming models with up to 78B parameters.
We open-source MiMo-VL-7B-SFT and MiMo-VL-7B-RL, two powerful vision-language models delivering state-of-the-art performance in both general visual understanding and multimodal reasoning. MiMo-VL-7B-RL outperforms Qwen2.5-VL-7B on 35 out of 40 evaluated tasks, and scores 59.4 on OlympiadBench, surpassing models with up to 78B parameters. For GUI grounding applications, it sets a new standard with 56.1 on OSWorld-G, even outperforming specialized models such as UI-TARS. Our training combines four-stage pre-training (2.4 trillion tokens) with Mixed On-policy Reinforcement Learning (MORL) integrating diverse reward signals. We identify the importance of incorporating high-quality reasoning data with long Chain-of-Thought into pre-training stages, and the benefits of mixed RL despite challenges in simultaneous multi-domain optimization. We also contribute a comprehensive evaluation suite covering 50+ tasks to promote reproducibility and advance the field. The model checkpoints and full evaluation suite are available at https://github.com/XiaomiMiMo/MiMo-VL.
Community
MiMo-VL technical report.
HF Models: https://huggingface.co/collections/XiaomiMiMo/mimo-vl-68382ccacc7c2875500cd212
GitHub: https://github.com/XiaomiMiMo/MiMo-VL
Evaluation Suite: https://github.com/XiaomiMiMo/lmms-eval
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
- Seed1.5-VL Technical Report (2025)
- Mixed-R1: Unified Reward Perspective For Reasoning Capability in Multimodal Large Language Models (2025)
- MiMo: Unlocking the Reasoning Potential of Language Model -- From Pretraining to Posttraining (2025)
- Reinforcement Learning Tuning for VideoLLMs: Reward Design and Data Efficiency (2025)
- Unsupervised Post-Training for Multi-Modal LLM Reasoning via GRPO (2025)
- Skywork-VL Reward: An Effective Reward Model for Multimodal Understanding and Reasoning (2025)
- G1: Bootstrapping Perception and Reasoning Abilities of Vision-Language Model via Reinforcement Learning (2025)
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment:
@librarian-bot
recommend
Models citing this paper 2
Datasets citing this paper 0
No dataset linking this paper