---
license: mit
language:
- en
pipeline_tag: image-text-to-text
tags:
- VLM
- multimodal
library_name: transformers
base_model:
- Qwen/Qwen2.5-VL-32B-Instruct
datasets:
- One-RL-to-See-Them-All/Orsta-Data-47k
---
# One RL to See Them All
* 🐙 **GitHub Repo:** [MiniMax-AI/One-RL-to-See-Them-All](https://github.com/MiniMax-AI/One-RL-to-See-Them-All)
* 📜 **Paper (arXiv):** [V-Triune: One RL to See Them All (arXiv:2505.18129)](https://arxiv.org/abs/2505.18129)
* 💾 **Dataset:** [Orsta-Data-47k on Hugging Face](https://huggingface.co/datasets/One-RL-to-See-Them-All/Orsta-Data-47k)

## Model Overview

**Orsta-32B-0321** is a cutting-edge vision-language model (VLM) designed to achieve superior performance across a wide spectrum of both visual reasoning and visual perception tasks. This model is a result of post-training with [**V-Triune**](https://github.com/MiniMax-AI/One-RL-to-See-Them-All), our novel unified reinforcement learning (RL) system.

The V-Triune system enables VLMs to be jointly optimized on diverse multimodal tasks within a single, cohesive training pipeline. Orsta-32B-0321 has been specifically trained using V-Triune on a carefully curated set of eight challenging visual tasks, fostering robust generalization and enhanced capabilities.


## Training with V-Triune

Orsta-32B-0321's advanced abilities stem from its training with the V-Triune system. Key aspects of its training include:

* **Unified RL Framework (V-Triune):** V-Triune is a Visual Triple-Unified Reinforcement Learning system featuring three core complementary components:
  * *Sample-Level Data Formatting* (to unify diverse task inputs)
  * *Verifier-Level Reward Computation* (to deliver custom rewards via specialized verifiers)
  * *Source-Level Metric Monitoring* (to diagnose problems at the data-source level)
  * It also incorporates an innovative *Dynamic IoU reward* mechanism, crucial for optimizing visual perception tasks. You can find more details in our paper: [V-Triune](https://arxiv.org/abs/2505.18129)

* **Diverse Joint Task Optimization:** Orsta-32B-0321 was jointly optimized on the following eight visual tasks:
  * *Visual Reasoning Tasks:* Mathematics, Science Question Answering, Chart Understanding, and Puzzle Solving.
  * *Visual Perception Tasks:* Object Detection, Visual Grounding, Optical Character Recognition (OCR), and Object Counting.

This comprehensive training allows Orsta-32B-0321 to develop a deeper understanding of visual content and its relation to textual prompts, excelling in tasks that require intricate reasoning and precise perception.

## Performance
| Model                                          | Knowledge   | Mathematics   | Perception   | Coding   | Info. Ex.   | Planning   | Science   | Metrics   | MEGA-Bench<br>Core   |
| :--------------------------------------------- | ----------: | ------------: | -----------: | -------: | ----------: | ---------: | --------: | --------: | ------------------: |
| QwenVL-2.5-32B-0321                            | 8.48        | 12.62         | 11.99        | 13.59    | 15.44       | 8.61       | 16.78     | 14.91     | 11.87               |
| MM-Eureka-32B 💡                               | 12.20       | 20.19         | 21.88        | 15.86    | 21.23       | 15.47      | 19.95     | 22.77     | 18.57               |
| VL-Rethinker-32B 💡                            | 12.16       | 28.09         | 22.99        | 11.89    | 21.50       | 15.09      | 28.10     | 15.73     | 19.41               |
| **Orsta-32B-0321 (Ours) 💡** | **21.33** | **28.55** | **32.23** | **19.44**| **26.38** | **17.78** | **33.20** | **24.18** | **25.94** |
| - | - | - | - | - | - | - | - | - | - |
| Δ (Ours - Backbone)                            | +12.9       | +15.9         | +20.2        | +5.9     | +10.9       | +9.2       | +16.4     | +9.3      | +14.1               |

## How to Use

**Orsta-32B-0321** is developed by post-training the [**Qwen2.5-VL-32B-Instruct (0321 checkpoint)**](https://huggingface.co/Qwen/Qwen2.5-VL-32B-Instruct/tree/98948557b47f3244ac2764806ddd334ce3c684f9) model using our V-Triune reinforcement learning system. The Qwen2.5-VL-32B-Instruct (0321 checkpoint) is a publicly available baseline known for its reliable core reasoning abilities, alongside certain recognized limitations in perception and output formatting (which have been addressed in subsequent Qwen releases). Applying V-Triune to this specific baseline demonstrates its powerful post-training capability to unlock the model's inherent potential and significantly elevate its performance by refining and amplifying existing strengths.

Consequently, the core usage of **Orsta-32B-0321**, particularly regarding input formatting and model interaction, largely follows the established patterns of the Qwen2.5-VL series. Users familiar with Qwen2.5-VL models should find the interface intuitive.

For comprehensive details on the general capabilities of Qwen2.5-VL models, including multi-turn dialogue format and image input specifics, we recommend referring to the official [Qwen2.5-VL series documentation](https://huggingface.co/Qwen/Qwen2.5-VL-32B-Instruct) (please ensure to consult information relevant to the 32B Instruct version).

## Citation 🏆
If you use Orsta-32B-0321 or the V-Triune system in your research, please cite our work:
```bibtex
@article{ma2025one,
      title={One RL to See Them All: Visual Triple Unified Reinforcement Learning}, 
      author={Ma, Yan and Du, Linge and Shen, Xuyang and Chen, Shaoxiang and Li, Pengfei and Ren, Qibing and Ma, Lizhuang and Dai, Yuchao and Liu, Pengfei and Yan, Junjie},
      journal={arXiv preprint arXiv:2505.18129},
      year={2025}
}
```