File size: 5,464 Bytes
92de410
ce1d175
 
 
 
92de410
 
ce1d175
 
92de410
 
 
 
 
ce1d175
 
 
92de410
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
f5aa1b4
92de410
 
 
 
 
 
5245f86
92de410
 
 
 
 
 
 
 
 
 
5656be2
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
---
base_model:
- Qwen/Qwen2.5-VL-32B-Instruct
datasets:
- One-RL-to-See-Them-All/Orsta-Data-47k
language:
- en
library_name: transformers
license: mit
pipeline_tag: image-text-to-text
tags:
- VLM
- multimodal
---

# One RL to See Them All: Visual Triple Unified Reinforcement Learning

* πŸ™ **GitHub Repo:** [MiniMax-AI/One-RL-to-See-Them-All](https://github.com/MiniMax-AI/One-RL-to-See-Them-All)
* πŸ“œ **Paper (arXiv):** [V-Triune: One RL to See Them All (arXiv:2505.18129)](https://arxiv.org/abs/2505.18129)
* πŸ’Ύ **Dataset:** [Orsta-Data-47k on Hugging Face](https://huggingface.co/datasets/One-RL-to-See-Them-All/Orsta-Data-47k)

## Model Overview

**Orsta-Orsta-32B-0326** is a cutting-edge vision-language model (VLM) designed to achieve superior performance across a wide spectrum of both visual reasoning and visual perception tasks. This model is a result of post-training with [**V-Triune**](https://github.com/MiniMax-AI/One-RL-to-See-Them-All), our novel unified reinforcement learning (RL) system.

The V-Triune system enables VLMs to be jointly optimized on diverse multimodal tasks within a single, cohesive training pipeline. Orsta-7B has been specifically trained using V-Triune on a carefully curated set of eight challenging visual tasks, fostering robust generalization and enhanced capabilities.

## Training with V-Triune

Orsta-32B-0326's advanced abilities stem from its training with the V-Triune system. Key aspects of its training include:

* **Unified RL Framework (V-Triune):** V-Triune is a Visual Triple-Unified Reinforcement Learning system featuring three core complementary components:

  * *Sample-Level Data Formatting* (to unify diverse task inputs)
  * *Verifier-Level Reward Computation* (to deliver custom rewards via specialized verifiers)
  * *Source-Level Metric Monitoring* (to diagnose problems at the data-source level)
  * It also incorporates an innovative *Dynamic IoU reward* mechanism, crucial for optimizing visual perception tasks. You can find more details in our paper: [V-Triune](https://arxiv.org/abs/2505.18129)

* **Diverse Joint Task Optimization:** Orsta-32B-0326 was jointly optimized on the following eight visual tasks:

  * *Visual Reasoning Tasks:* Mathematics, Science Question Answering, Chart Understanding, and Puzzle Solving.
  * *Visual Perception Tasks:* Object Detection, Visual Grounding, Optical Character Recognition (OCR), and Object Counting.

This comprehensive training allows Orsta-32B-0326 to develop a deeper understanding of visual content and its relation to textual prompts, excelling in tasks that require intricate reasoning and precise perception.

## Performance
| Model                                          | Knowledge   | Mathematics   | Perception   | Coding   | Info. Ex.   | Planning   | Science   | Metrics   | MEGA-Bench<br>Core   |
| :--------------------------------------------- | ----------: | ------------: | -----------: | -------: | ----------: | ---------: | --------: | --------: | ------------------: |
| Gemma3-27B                                     | 49.43       | 42.20         | 45.46        | 40.18    | 49.30       | 24.96      | 47.08     | 58.99     | 41.82 †             |
| QwenVL-2.5-32B-0326                            | 46.09       | 32.04         | 47.55        | 38.36    | 61.65       | 28.43      | 37.55     | 50.38     | 43.67               |
| InternVL-3-38B                                 | 46.32       | **40.29** | **55.05** | **45.29**| 56.63       | 22.88      | **52.04** | **58.04** | **46.69** |
| Skywork-R1V-38B πŸ’‘                             | 25.59       | 28.45         | 22.95        | 19.88    | 19.53       | 9.74       | 22.64     | 37.55     | 21.54               |
| Skywork-R1V2-38B πŸ’‘                            | 17.08       | 12.38         | 15.65        | 7.14     | 9.90        | 17.60      | 14.29     | 0.0       | 15.39               |
| **Orsta-32B-0326 (Ours) πŸ’‘** | **46.78** | 37.43         | 50.86        | 38.92    | **63.14** | 28.05      | 42.68     | 53.01     | **45.78** |
| - | - | - | - | - | - | - | - | - | - |
| Ξ” (Ours - Backbone)                            | +0.7        | +5.4          | +3.3         | +0.6     | +1.5        | -0.4       | +5.1      | +2.6      | +2.1                |

## How to Use

**Orsta-32B-0326** is developed by post-training the latest [**Qwen2.5-VL-32B-Instruct**](https://huggingface.co/Qwen/Qwen2.5-VL-32B-Instruct) model using our V-Triune reinforcement learning system. Consequently, its core usage, particularly regarding input formatting and model interaction, largely follows the established patterns of the Qwen2.5-VL series.

For comprehensive details on the base model's capabilities, multi-turn dialogue format, image input encoding specifics, and other functionalities, we recommend referring to the official [Qwen2.5-VL documentation](https://huggingface.co/Qwen/Qwen2.5-VL-32B-Instruct).

## Citation πŸ†
If you use Orsta-32B-0326 or the V-Triune system in your research, please cite our work:
```bibtex
@article{ma2025one,
      title={One RL to See Them All: Visual Triple Unified Reinforcement Learning}, 
      author={Ma, Yan and Du, Linge and Shen, Xuyang and Chen, Shaoxiang and Li, Pengfei and Ren, Qibing and Ma, Lizhuang and Dai, Yuchao and Liu, Pengfei and Yan, Junjie},
      journal={arXiv preprint arXiv:2505.18129},
      year={2025}
}
```

## Project Page
https://github.com/MiniMax-AI/One-RL-to-See-Them-All.