|
--- |
|
base_model: |
|
- Qwen/Qwen2.5-VL-32B-Instruct |
|
datasets: |
|
- One-RL-to-See-Them-All/Orsta-Data-47k |
|
language: |
|
- en |
|
library_name: transformers |
|
license: mit |
|
pipeline_tag: image-text-to-text |
|
tags: |
|
- VLM |
|
- multimodal |
|
--- |
|
|
|
# One RL to See Them All: Visual Triple Unified Reinforcement Learning |
|
|
|
* π **GitHub Repo:** [MiniMax-AI/One-RL-to-See-Them-All](https://github.com/MiniMax-AI/One-RL-to-See-Them-All) |
|
* π **Paper (arXiv):** [V-Triune: One RL to See Them All (arXiv:2505.18129)](https://arxiv.org/abs/2505.18129) |
|
* πΎ **Dataset:** [Orsta-Data-47k on Hugging Face](https://huggingface.co/datasets/One-RL-to-See-Them-All/Orsta-Data-47k) |
|
|
|
## Model Overview |
|
|
|
**Orsta-Orsta-32B-0326** is a cutting-edge vision-language model (VLM) designed to achieve superior performance across a wide spectrum of both visual reasoning and visual perception tasks. This model is a result of post-training with [**V-Triune**](https://github.com/MiniMax-AI/One-RL-to-See-Them-All), our novel unified reinforcement learning (RL) system. |
|
|
|
The V-Triune system enables VLMs to be jointly optimized on diverse multimodal tasks within a single, cohesive training pipeline. Orsta-7B has been specifically trained using V-Triune on a carefully curated set of eight challenging visual tasks, fostering robust generalization and enhanced capabilities. |
|
|
|
## Training with V-Triune |
|
|
|
Orsta-32B-0326's advanced abilities stem from its training with the V-Triune system. Key aspects of its training include: |
|
|
|
* **Unified RL Framework (V-Triune):** V-Triune is a Visual Triple-Unified Reinforcement Learning system featuring three core complementary components: |
|
|
|
* *Sample-Level Data Formatting* (to unify diverse task inputs) |
|
* *Verifier-Level Reward Computation* (to deliver custom rewards via specialized verifiers) |
|
* *Source-Level Metric Monitoring* (to diagnose problems at the data-source level) |
|
* It also incorporates an innovative *Dynamic IoU reward* mechanism, crucial for optimizing visual perception tasks. You can find more details in our paper: [V-Triune](https://arxiv.org/abs/2505.18129) |
|
|
|
* **Diverse Joint Task Optimization:** Orsta-32B-0326 was jointly optimized on the following eight visual tasks: |
|
|
|
* *Visual Reasoning Tasks:* Mathematics, Science Question Answering, Chart Understanding, and Puzzle Solving. |
|
* *Visual Perception Tasks:* Object Detection, Visual Grounding, Optical Character Recognition (OCR), and Object Counting. |
|
|
|
This comprehensive training allows Orsta-32B-0326 to develop a deeper understanding of visual content and its relation to textual prompts, excelling in tasks that require intricate reasoning and precise perception. |
|
|
|
## Performance |
|
| Model | Knowledge | Mathematics | Perception | Coding | Info. Ex. | Planning | Science | Metrics | MEGA-Bench<br>Core | |
|
| :--------------------------------------------- | ----------: | ------------: | -----------: | -------: | ----------: | ---------: | --------: | --------: | ------------------: | |
|
| Gemma3-27B | 49.43 | 42.20 | 45.46 | 40.18 | 49.30 | 24.96 | 47.08 | 58.99 | 41.82 β | |
|
| QwenVL-2.5-32B-0326 | 46.09 | 32.04 | 47.55 | 38.36 | 61.65 | 28.43 | 37.55 | 50.38 | 43.67 | |
|
| InternVL-3-38B | 46.32 | **40.29** | **55.05** | **45.29**| 56.63 | 22.88 | **52.04** | **58.04** | **46.69** | |
|
| Skywork-R1V-38B π‘ | 25.59 | 28.45 | 22.95 | 19.88 | 19.53 | 9.74 | 22.64 | 37.55 | 21.54 | |
|
| Skywork-R1V2-38B π‘ | 17.08 | 12.38 | 15.65 | 7.14 | 9.90 | 17.60 | 14.29 | 0.0 | 15.39 | |
|
| **Orsta-32B-0326 (Ours) π‘** | **46.78** | 37.43 | 50.86 | 38.92 | **63.14** | 28.05 | 42.68 | 53.01 | **45.78** | |
|
| - | - | - | - | - | - | - | - | - | - | |
|
| Ξ (Ours - Backbone) | +0.7 | +5.4 | +3.3 | +0.6 | +1.5 | -0.4 | +5.1 | +2.6 | +2.1 | |
|
|
|
## How to Use |
|
|
|
**Orsta-32B-0326** is developed by post-training the latest [**Qwen2.5-VL-32B-Instruct**](https://huggingface.co/Qwen/Qwen2.5-VL-32B-Instruct) model using our V-Triune reinforcement learning system. Consequently, its core usage, particularly regarding input formatting and model interaction, largely follows the established patterns of the Qwen2.5-VL series. |
|
|
|
For comprehensive details on the base model's capabilities, multi-turn dialogue format, image input encoding specifics, and other functionalities, we recommend referring to the official [Qwen2.5-VL documentation](https://huggingface.co/Qwen/Qwen2.5-VL-32B-Instruct). |
|
|
|
## Citation π |
|
If you use Orsta-32B-0326 or the V-Triune system in your research, please cite our work: |
|
```bibtex |
|
@article{ma2025one, |
|
title={One RL to See Them All: Visual Triple Unified Reinforcement Learning}, |
|
author={Ma, Yan and Du, Linge and Shen, Xuyang and Chen, Shaoxiang and Li, Pengfei and Ren, Qibing and Ma, Lizhuang and Dai, Yuchao and Liu, Pengfei and Yan, Junjie}, |
|
journal={arXiv preprint arXiv:2505.18129}, |
|
year={2025} |
|
} |
|
``` |
|
|
|
## Project Page |
|
https://github.com/MiniMax-AI/One-RL-to-See-Them-All. |