Image-Text-to-Text
Transformers
Safetensors
qwen2_5_vl
image-to-text
conversational
text-generation-inference
Qwen2.5VL-3b-RLCS / README.md
WaltonFuture's picture
Improve model card with detailed description, usage, and additional info (#2)
6170b13 verified
metadata
base_model:
  - Qwen/Qwen2.5-VL-3B-Instruct
datasets:
  - WaltonFuture/Multimodal-Cold-Start
  - WaltonFuture/Multimodal-RL-Data
library_name: transformers
license: apache-2.0
pipeline_tag: image-text-to-text

Advancing Multimodal Reasoning via Reinforcement Learning with Cold Start

Introduction

This model is presented in the paper "Advancing Multimodal Reasoning via Reinforcement Learning with Cold Start". We present a comprehensive study on enhancing multimodal reasoning through a two-stage approach: (1) supervised fine-tuning (SFT) as a cold start with structured chain-of-thought reasoning patterns, followed by (2) reinforcement learning via GRPO to further refine these capabilities.

Our extensive experiments show that this combined approach consistently outperforms both SFT-only and RL-only methods across challenging multimodal reasoning benchmarks. The resulting models achieve state-of-the-art performance among open-source MLLMs at both 3B and 7B scales, with our 7B model showing substantial improvements over base models (e.g., 66.3%β†’73.4% on MathVista, 62.9%β†’70.4% on We-Math) and our 3B model achieving performance competitive with several 7B models.

Model Comparison

✨ Key Highlights

  • Two-Stage Approach: Combines Supervised Fine-Tuning (SFT) as a "cold start" for structured chain-of-thought reasoning with Reinforcement Learning (RL) via GRPO for further refinement.
  • Enhanced Multimodal Reasoning: Consistently outperforms both SFT-only and RL-only methods on challenging multimodal reasoning benchmarks.
  • State-of-the-Art Performance: Achieves SOTA performance among open-source MLLMs at both 3B and 7B scales.
  • Significant Improvements: The 7B model shows substantial gains (e.g., 73.4% on MathVista, 70.4% on We-Math) over base models, while the 3B model is competitive with several 7B models.
  • Practical Guidance: Provides practical insights for developing advanced multimodal reasoning models.

Sample Usage

You can easily load and use this model with the Hugging Face transformers library. Ensure you have transformers and Pillow installed.

pip install transformers Pillow

Below is an example demonstrating how to perform multimodal inference:

from transformers import AutoProcessor, AutoModelForCausalLM
from PIL import Image
import torch

# Load the model and processor
# Replace "WaltonFuture/Qwen2.5VL-3b-RLCS" with "WaltonFuture/Qwen2.5VL-7b-RLCS" for the 7B model.
model_id = "WaltonFuture/Qwen2.5VL-3b-RLCS" 

processor = AutoProcessor.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(model_id, torch_dtype=torch.bfloat16, device_map="auto")

# Example image (replace with your image path or a PIL Image object)
# Make sure to provide a valid image path.
# For example, download an image locally:
# import requests
# from io import BytesIO
# image_url = "https://www.ilusionviajera.com/wp-content/uploads/2021/04/paris-eiffel-tower-in-spring.jpg"
# response = requests.get(image_url)
# image = Image.open(BytesIO(response.content)).convert("RGB")
image_path = "path/to/your/image.jpg" # Replace with your image path
image = Image.open(image_path).convert("RGB")

# Prepare the chat messages in the required multimodal format
messages = [
    {
        "role": "user",
        "content": [
            {"type": "image", "image": image},
            {"type": "text", "text": "Describe this image in detail and answer any questions about it. For example, what is the main subject?"},
        ],
    }
]

# Apply the model's chat template to format the input
text = processor.apply_chat_template(
    messages,
    tokenize=False,
    add_generation_prompt=True
)

# Process the inputs (text and image) for the model
input_ids = processor(text=text, images=image, return_tensors="pt").input_ids.to(model.device)

# Generate the response
outputs = model.generate(input_ids=input_ids, max_new_tokens=512, do_sample=True, temperature=0.7)

# Decode the generated tokens to a human-readable response
response = processor.batch_decode(outputs, skip_special_tokens=True)[0]

print(response)

Data Access

Our two-stage datasets are now available on Hugging Face:

Stage Data
Cold Start Multimodal-Cold-Start
RL Multimodal-RL-Data

Model Access

Our models are now available on Hugging Face:

Backbone Our model
Qwen2.5-VL-7b Qwen2.5VL-7b-RL-with-Cold-Start
Qwen2.5-VL-3b Qwen2.5VL-3b-RL-with-Cold-Start

Acknowledgment

Our models are built upon the amazing Qwen2.5-VL family. We thank EasyR1 and ms-swift for their training codes.

Citation

If our work has been helpful to you, please consider citing it:

@article{wei2025advancing,
  title={Advancing Multimodal Reasoning via Reinforcement Learning with Cold Start},
  author={Wei, Lai and Li, Yuting and Zheng, Kaipeng and Wang, Chen and Wang, Yue and Kong, Linghe and Sun, Lichao and Huang, Weiran},
  journal={arXiv preprint arXiv:2505.22334},
  year={2025}
}