Orsta-32B-0326 / README.md

Add project page to model card

5656be2 verified about 2 months ago

5.46 kB

	---
	base_model:
	- Qwen/Qwen2.5-VL-32B-Instruct
	datasets:
	- One-RL-to-See-Them-All/Orsta-Data-47k
	language:
	- en
	library_name: transformers
	license: mit
	pipeline_tag: image-text-to-text
	tags:
	- VLM
	- multimodal
	---

	# One RL to See Them All: Visual Triple Unified Reinforcement Learning

	* 🐙 GitHub Repo: [MiniMax-AI/One-RL-to-See-Them-All](https://github.com/MiniMax-AI/One-RL-to-See-Them-All)
	* 📜 Paper (arXiv): [V-Triune: One RL to See Them All (arXiv:2505.18129)](https://arxiv.org/abs/2505.18129)
	* 💾 Dataset: [Orsta-Data-47k on Hugging Face](https://huggingface.co/datasets/One-RL-to-See-Them-All/Orsta-Data-47k)

	## Model Overview

	Orsta-Orsta-32B-0326 is a cutting-edge vision-language model (VLM) designed to achieve superior performance across a wide spectrum of both visual reasoning and visual perception tasks. This model is a result of post-training with [V-Triune](https://github.com/MiniMax-AI/One-RL-to-See-Them-All), our novel unified reinforcement learning (RL) system.

	The V-Triune system enables VLMs to be jointly optimized on diverse multimodal tasks within a single, cohesive training pipeline. Orsta-7B has been specifically trained using V-Triune on a carefully curated set of eight challenging visual tasks, fostering robust generalization and enhanced capabilities.

	## Training with V-Triune

	Orsta-32B-0326's advanced abilities stem from its training with the V-Triune system. Key aspects of its training include:

	* Unified RL Framework (V-Triune): V-Triune is a Visual Triple-Unified Reinforcement Learning system featuring three core complementary components:

	* Sample-Level Data Formatting (to unify diverse task inputs)
	* Verifier-Level Reward Computation (to deliver custom rewards via specialized verifiers)
	* Source-Level Metric Monitoring (to diagnose problems at the data-source level)
	* It also incorporates an innovative Dynamic IoU reward mechanism, crucial for optimizing visual perception tasks. You can find more details in our paper: [V-Triune](https://arxiv.org/abs/2505.18129)

	* Diverse Joint Task Optimization: Orsta-32B-0326 was jointly optimized on the following eight visual tasks:

	* Visual Reasoning Tasks: Mathematics, Science Question Answering, Chart Understanding, and Puzzle Solving.
	* Visual Perception Tasks: Object Detection, Visual Grounding, Optical Character Recognition (OCR), and Object Counting.

	This comprehensive training allows Orsta-32B-0326 to develop a deeper understanding of visual content and its relation to textual prompts, excelling in tasks that require intricate reasoning and precise perception.

	## Performance
	\| Model \| Knowledge \| Mathematics \| Perception \| Coding \| Info. Ex. \| Planning \| Science \| Metrics \| MEGA-Bench<br>Core \|
	\| :--------------------------------------------- \| ----------: \| ------------: \| -----------: \| -------: \| ----------: \| ---------: \| --------: \| --------: \| ------------------: \|
	\| Gemma3-27B \| 49.43 \| 42.20 \| 45.46 \| 40.18 \| 49.30 \| 24.96 \| 47.08 \| 58.99 \| 41.82 † \|
	\| QwenVL-2.5-32B-0326 \| 46.09 \| 32.04 \| 47.55 \| 38.36 \| 61.65 \| 28.43 \| 37.55 \| 50.38 \| 43.67 \|
	\| InternVL-3-38B \| 46.32 \| 40.29 \| 55.05 \| 45.29\| 56.63 \| 22.88 \| 52.04 \| 58.04 \| 46.69 \|
	\| Skywork-R1V-38B 💡 \| 25.59 \| 28.45 \| 22.95 \| 19.88 \| 19.53 \| 9.74 \| 22.64 \| 37.55 \| 21.54 \|
	\| Skywork-R1V2-38B 💡 \| 17.08 \| 12.38 \| 15.65 \| 7.14 \| 9.90 \| 17.60 \| 14.29 \| 0.0 \| 15.39 \|
	\| Orsta-32B-0326 (Ours) 💡 \| 46.78 \| 37.43 \| 50.86 \| 38.92 \| 63.14 \| 28.05 \| 42.68 \| 53.01 \| 45.78 \|
	\| - \| - \| - \| - \| - \| - \| - \| - \| - \| - \|
	\| Δ (Ours - Backbone) \| +0.7 \| +5.4 \| +3.3 \| +0.6 \| +1.5 \| -0.4 \| +5.1 \| +2.6 \| +2.1 \|

	## How to Use

	Orsta-32B-0326 is developed by post-training the latest [Qwen2.5-VL-32B-Instruct](https://huggingface.co/Qwen/Qwen2.5-VL-32B-Instruct) model using our V-Triune reinforcement learning system. Consequently, its core usage, particularly regarding input formatting and model interaction, largely follows the established patterns of the Qwen2.5-VL series.

	For comprehensive details on the base model's capabilities, multi-turn dialogue format, image input encoding specifics, and other functionalities, we recommend referring to the official [Qwen2.5-VL documentation](https://huggingface.co/Qwen/Qwen2.5-VL-32B-Instruct).

	## Citation 🏆
	If you use Orsta-32B-0326 or the V-Triune system in your research, please cite our work:
	```bibtex
	@article{ma2025one,
	title={One RL to See Them All: Visual Triple Unified Reinforcement Learning},
	author={Ma, Yan and Du, Linge and Shen, Xuyang and Chen, Shaoxiang and Li, Pengfei and Ren, Qibing and Ma, Lizhuang and Dai, Yuchao and Liu, Pengfei and Yan, Junjie},
	journal={arXiv preprint arXiv:2505.18129},
	year={2025}
	}
	```

	## Project Page
	https://github.com/MiniMax-AI/One-RL-to-See-Them-All.