--- base_model: - Qwen/Qwen2.5-VL-3B-Instruct datasets: - omlab/OVDEval language: - en license: apache-2.0 pipeline_tag: image-text-to-text library_name: transformers --- A OVD enhanced Qwen 2.5VL 3B with VLM-R1 reinforcement learning. Cite: arxiv.org/abs/2504.07615 Project page: [https://github.com/om-ai-lab/VLM-R1](https://github.com/om-ai-lab/VLM-R1)

🎉 Our VLM-R1 Math model reaches the top of the Open-Compass Math Leaderboard (under 4B parameters) and OVD model achieves the state-of-the-art performance on OVDEval.

Since the introduction of [Deepseek-R1](https://github.com/deepseek-ai/DeepSeek-R1), numerous works have emerged focusing on reproducing and improving upon it. In this project, we propose VLM-R1, a stable and generalizable R1-style Large Vision-Language Model. Specifically, for the task of Referring Expression Comprehension (REC), we trained [Qwen2.5-VL](https://github.com/QwenLM/Qwen2.5-VL) using both R1 and SFT approaches. The results reveal that, on the in-domain test data, the performance of the SFT model shows little change compared to that of the R1 model base model when the number of training steps is relatively small (100–600 steps), while the R1 model shows a steady improvement (as shown at the left of the figure below). More importantly, on the out-of-domain test data, the SFT model’s performance deteriorates slightly as the number of steps increases. Nevertheless, the RL model generalizes its reasoning ability to the out-of-domain data (as shown at the right of the figure below). ![image/png](https://cdn-uploads.huggingface.co/production/uploads/658a2e94991d8e7fb24f7688/j0wgYuwYoWCS4rhVl7H1G.png) \* *We found previous REC SFT exps used a mismatch pixel config. Therefore, we re-run the study with the correct config on a more complex out-of-domain data. See our [findings](https://om-ai-lab.github.io/2025_03_24.html) for details.* ## 🚀 Features This repository supports: - **`Full Fine-tuning for GRPO`**: see [run_grpo_rec.sh](https://github.com/om-ai-lab/VLM-R1/blob/main/src/open-r1-multimodal/run_scripts/run_grpo_rec.sh) - **`Freeze Vision Modules`**: set `freeze_vision_modules` as `true` in the script. - **`LoRA Fine-tuning for GRPO`**: see [run_grpo_rec_lora.sh](https://github.com/om-ai-lab/VLM-R1/blob/main/src/open-r1-multimodal/run_scripts/run_grpo_rec_lora.sh) - **`Multi-node Training`**: see [multinode_training_demo.sh](https://github.com/om-ai-lab/VLM-R1/blob/main/src/open-r1-multimodal/run_scripts/multinode_training_demo.sh) - **`Multi-image Input Training`**: see [run_grpo_gui.sh](https://github.com/om-ai-lab/VLM-R1/blob/main/src/open-r1-multimodal/run_scripts/run_grpo_gui.sh) - **`For your own data`**: see [here](https://github.com/om-ai-lab/VLM-R1/blob/main/README.md#for-your-own-data) - **`Support various VLMs`**: see [How to add a new model](https://github.com/om-ai-lab/VLM-R1/blob/main/assets/add_new_model.md), now we support QwenVL and InternVL ## 🗞️ Update - **`2025-04-11`**: 🔥🔥🔥 We release the [technical report](https://arxiv.org/abs/2504.07615) of VLM-R1, summarizing our main results and insights. - **`2025-04-03`**: We add the `odLength`, `weighted_sum`, and `cosine` reward used in OVD task, please refer our [blog post](https://om-ai-lab.github.io/2025_03_20.html) and [findings](https://om-ai-lab.github.io/2025_03_24.html) to the details of the reward usage and see [grpo_jsonl.py](https://github.com/om-ai-lab/VLM-R1/blob/main/src/open-r1-multimodal/src/open_r1/grpo_jsonl.py) for code implementation. - **`2025-03-24`**: 🔥 We release the [findings](https://om-ai-lab.github.io/2025_03_24.html) of VLM-R1-OVD. - **`2025-03-23`**: 🔥 We release the VLM-R1-OVD [model weights](https://huggingface.co/omlab/VLM-R1-Qwen2.5VL-3B-OVD-0321) and [demo](https://huggingface.co/spaces/omlab/VLM-R1-OVD), which shows the state-of-the-art performance on OVDEval. Welcome to use it. - **`2025-03-20`**: 🔥 We achieved SOTA results on [OVDEval](https://github.com/om-ai-lab/OVDEval) with our RL-based model, outperforming SFT baselines and specialized object detection models. Read our [blog post](https://om-ai-lab.github.io/2025_03_20.html) for details on how reinforcement learning enhances object detection performance. - **`2025-03-17`**: Our VLM-R1 Math model reaches the top of the [Open-Compass Math Leaderboard](https://rank.opencompass.org.cn/leaderboard-multimodal-reasoning/?m=REALTIME) (under 4B parameters). We have released the [checkpoint](https://huggingface.co/omlab/VLM-R1-Qwen2.5VL-3B-Math-0305). - **`2025-03-15`**: We support multi-image input data. Check the format of multi-image input [here](https://github.com/om-ai-lab/VLM-R1/blob/main/README.md#for-your-own-data). We also provide an example of multi-image script [run_grpo_gui.sh](https://github.com/om-ai-lab/VLM-R1/blob/main/src/open-r1-multimodal/run_scripts/run_grpo_gui.sh), see [here](https://github.com/om-ai-lab/VLM-R1/blob/main/README.md#for-your-own-data) for details. - **`2025-03-13`**: We support InternVL for GRPO. See [run_grpo_rec_internvl.sh](https://github.com/om-ai-lab/VLM-R1/blob/main/src/open-r1-multimodal/run_scripts/run_grpo_rec_internvl.sh) for details. The annotation json files used in InternVL are [here](https://huggingface.co/datasets/omlab/VLM-R1/resolve/main/rec_jsons_internvl.zip). If you want to add your new model, please refer to [How to add a new model](https://github.com/om-ai-lab/VLM-R1/blob/main/assets/add_new_model.md). - **`2025-03-02`**: We support LoRA Fine-tuning for GRPO. See [run_grpo_rec_lora.sh](https://github.com/om-ai-lab/VLM-R1/blob/main/src/open-r1-multimodal/run_scripts/run_grpo_rec_lora.sh) for details. - **`2025-02-27`**: We support the `number of iterations per batch` and `epsilon value for clipping` in the original GRPO algorithm with args: `--num_iterations` and `--epsilon`. - **`2025-02-25`**: We support multi-node training for GRPO. See [multinode_training_demo.sh](https://github.com/om-ai-lab/VLM-R1/blob/main/src/open-r1-multimodal/run_scripts/multinode_training_demo.sh) for details. - **`2025-02-21`**: We release the [checkpoint](https://huggingface.co/omlab/Qwen2.5VL-3B-VLM-R1-REC-500steps) of the VLM-R1 REC model. - **`2025-02-20`**: We release the script for [general data loading](https://github.com/om-ai-lab/VLM-R1/blob/main/README.md#for-your-own-data). - **`2025-02-19`**: We incorporate an explanation of the [SFT](https://github.com/om-ai-lab/VLM-R1/tree/main#sft) method. - **`2025-02-17`**: We release the VLM-R1 REC [Demo](https://huggingface.co/spaces/omlab/VLM-R1-Referral-Expression) on Hugging Face Spaces. - **`2025-02-15`**: We release the VLM-R1 repository and [GRPO](https://github.com/om-ai-lab/VLM-R1/tree/main#grpo) training script. ## 🤖 Models - **[`OVD`](https://huggingface.co/omlab/VLM-R1-Qwen2.5VL-3B-OVD-0321)**: Trained with VLM-R1, our Open-Vocabulary Detection (OVD) model achieves the state-of-the-art performance on OVDEval. - **[`Math`](https://huggingface.co/omlab/VLM-R1-Qwen2.5VL-3B-Math-0305)**: Through VLM-R1 training, our math model focuses on multimodal reasoning tasks and has achieved Top1 on the OpenCompass Multi-modal Reasoning Leaderboard among models < 4B. - **[`REC`](https://huggingface.co/omlab/Qwen2.5VL-3B-VLM-R1-REC-500steps)**: Trained with VLM-R1, our Referring Expression Comprehension (REC) model showcases the superior performance on out-of-domain data and a series of reasoning-grounding tasks. | Version | Base VLM | Checkpoint | Task Type | | -------------------------------- | ------------ | ---------------------------------------------------------------------------------------------------- | ------------------------- | | VLM-R1-Qwen2.5VL-3B-OVD-0321 | Qwen2.5VL-3B | [omlab/VLM-R1-Qwen2.5VL-3B-OVD-0321](https://huggingface.co/omlab/VLM-R1-Qwen2.5VL-3B-OVD-0321) | Open-Vocabulary Detection | | VLM-R1-Qwen2.5VL-3B-Math-0305 | Qwen2.5VL-3B | [omlab/VLM-R1-Qwen2.5VL-3B-Math-0305](https://huggingface.co/omlab/VLM-R1-Qwen2.5VL-3B-Math-0305) | Multi-Modal Math | | VLM-R1-Qwen2.5VL-3B-REC-500steps | Qwen2.5VL-3B | [omlab/Qwen2.5VL-3B-VLM-R1-REC-500steps](https://huggingface.co/omlab/Qwen2.5VL-3B-VLM-R1-REC-500steps) | REC/Reasoning-Grounding | ## 🛠️ Setup ```bash conda create -n vlm-r1 python=3.10 conda activate vlm-r1 bash setup.sh ``` ## 💪🏻 Training For full training instructions—including data preparation, hyperparameter setup, and how to reproduce our results—please refer to the Training Guide in our GitHub repository: [VLM-R1](https://github.com/om-ai-lab/VLM-R1) ## 🤝 Acknowledgements We would like to express our sincere gratitude to [DeepSeek](https://github.com/deepseek-ai/DeepSeek-R1), [Open-R1](https://github.com/huggingface/open-r1), [QwenVL](https://github.com/QwenLM/Qwen2.5-VL), [Open-R1-Multimodal](https://github.com/EvolvingLMMs-Lab/open-r1-multimodal), [R1-V](https://github.com/Deep-Agent/R1-V), [RefCOCO](https://github.com/lichengunc/refer), [RefGTA](https://github.com/mikittt/easy-to-understand-REG/tree/master/pyutils/refer2), [LLaMA-Factory](https://github.com/hiyouga/LLaMA-Factory), [OVDEval](https://github.com/om-ai-lab/OVDEval), [GUI-Testing-Arena](https://huggingface.co/datasets/songjah/GTArena-UI-Defects), and [LISA](https://github.com/dvlab-research/LISA) for providing open-source resources that contributed to the development of this project. ## ⭐️ Citation If you find this project useful, welcome to cite us. ```bib @article{shen2025vlm, title={Vlm-r1: A stable and generalizable r1-style large vision-language model}, author={Shen, Haozhan and Liu, Peng and Li, Jingcheng and Fang, Chunxin and Ma, Yibo and Liao, Jiajia and Shen, Qiaoli and Zhang, Zilun and Zhao, Kangjia and Zhang, Qianqian and Xu, Ruochen and Zhao, Tiancheng }, journal={arXiv preprint arXiv:2504.07615}, year={2025} } ```