|
--- |
|
license: mit |
|
tags: |
|
- RLinf |
|
language: |
|
- en |
|
metrics: |
|
- accuracy |
|
base_model: |
|
- Haozhan72/Openvla-oft-SFT-libero10-traj1 |
|
pipeline_tag: reinforcement-learning |
|
model-index: |
|
- name: RLinf-OpenVLAOFT-GRPO-LIBERO-10 |
|
results: |
|
- task: |
|
type: VLA |
|
dataset: |
|
type: libero_10 |
|
name: libero_10 |
|
metrics: |
|
- type: accuracy |
|
value: 94.35 |
|
--- |
|
|
|
<div align="center"> |
|
<img src="logo.svg" alt="RLinf-logo" width="500"/> |
|
</div> |
|
|
|
|
|
<div align="center"> |
|
<!-- <a href="TODO"><img src="https://img.shields.io/badge/arXiv-Paper-red?logo=arxiv"></a> --> |
|
<!-- <a href="TODO"><img src="https://img.shields.io/badge/HuggingFace-yellow?logo=huggingface&logoColor=white" alt="Hugging Face"></a> --> |
|
<a href="https://github.com/RLinf/RLinf"><img src="https://img.shields.io/badge/Github-blue"></a> |
|
<a href="https://rlinf.readthedocs.io/en/latest/"><img src="https://img.shields.io/badge/Documentation-Purple?color=8A2BE2&logo=readthedocs"></a> |
|
<!-- <a href="TODO"><img src="https://devin.ai/assets/deepwiki-badge.png" alt="Ask DeepWiki.com" style="height:20px;"></a> |
|
<a href="TODO"><img src="https://img.shields.io/badge/微信-green?logo=wechat&"></a> --> |
|
</div> |
|
|
|
<h1 align="center">RLinf: Reinforcement Learning Infrastructure for Agentic AI</h1> |
|
|
|
[RLinf](https://github.com/RLinf/RLinf) is a flexible and scalable open-source infrastructure designed for post-training foundation models (LLMs, VLMs, VLAs) via reinforcement learning. The 'inf' in RLinf stands for Infrastructure, highlighting its role as a robust backbone for next-generation training. It also stands for Infinite, symbolizing the system’s support for open-ended learning, continuous generalization, and limitless possibilities in intelligence development. |
|
|
|
|
|
<div align="center"> |
|
<img src="overview.png" alt="RLinf-overview" width="600"/> |
|
</div> |
|
|
|
## Model Description |
|
The RLinf-openvlaoft-libero series is trained on Haozhan72/Openvla-oft-SFT-libero-xxx-traj1 (including libero10, libero-object, libero-goal and libero-spatial), using the same base models and training datasets as verl. Training with RLinf yields SOTA performance. |
|
|
|
We use a mask to focus on valid action tokens, and compute token-level loss based on the Group Relative Policy Optimization (GRPO) advantage function, in order to enhance the model’s performance on spatial reasoning, object generalization, instruction generalization, and long-horizon tasks. |
|
|
|
|
|
## Evaluation and Results |
|
We trained and evaluated four models using RLinf: |
|
|
|
- RLinf-openvlaoft-libero-object Model (based on [Haozhan72/Openvla-oft-SFT-libero-object-traj1](https://huggingface.co/Haozhan72/Openvla-oft-SFT-libero-object-traj1)) |
|
- Recommended sampling settings: `temperature = 1.6`, `top_p = 1.0` |
|
|
|
- RLinf-openvlaoft-libero-spatial Model (based on [Haozhan72/Openvla-oft-SFT-libero-spatial-traj1](https://huggingface.co/Haozhan72/Openvla-oft-SFT-libero-spatial-traj1)) |
|
- Recommended sampling settings: `temperature = 1.6`, `top_p = 1.0` |
|
|
|
- RLinf-openvlaoft-libero-goal Model (based on [Haozhan72/Openvla-oft-SFT-libero-goal-traj1]((https://huggingface.co/Haozhan72/Openvla-oft-SFT-libero-goal-traj1))) |
|
- Recommended sampling settings: `temperature = 1.6`, `top_p = 1.0` |
|
|
|
- RLinf-openvlaoft-libero10 Model (based on [Haozhan72/Openvla-oft-SFT-libero10-traj1]((https://huggingface.co/Haozhan72/Openvla-oft-SFT-libero10-traj1))) |
|
- Recommended sampling settings: `temperature = 1.6`, `top_p = 1.0` |
|
|
|
### Benchmark Results |
|
|
|
All sft models are from [SimpleVLA-RL](https://huggingface.co/collections/Haozhan72/simplevla-rl-6833311430cd9df52aeb1f86). |
|
- Recommended sampleing setting for evaluation: `libero seed=0`; `episode number=500`; `do_sample=False` |
|
|
|
| Model | Object | Spatial | Goal | Long | Average | |
|
| ------------------ | ------ | ------- | ----- | ----- | ------- | |
|
| sft models | 25.60 | 56.45 | 45.59 | 9.68 | 34.33 | |
|
| trained with RLinf | 98.99 | 98.99 | 98.99 | 94.35 | 97.83 | |
|
|
|
<div align="center"> |
|
<img src="tensorboard-success_once.png" alt="RLinf-libero-result" width="600"/> |
|
</div> |
|
|
|
## How to Use |
|
Please integrate the provided model with the [RLinf](https://github.com/RLinf/RLinf) codebase. To do so, modify the following parameters in the configuration file ``examples/embodiment/config/libero_10_grpo_openvlaoft.yaml``: |
|
|
|
- Set ``actor.checkpoint_load_path``, ``actor.tokenizer.tokenizer_model``, and ``rollout.model_dir`` to the path of the model checkpoint. |
|
|
|
Note: If you intend to evaluate the model directly, make sure to set ``actor.model.is_lora`` to ``false``. |
|
|
|
## License |
|
This code repository and the model weights are licensed under the MIT License. |
|
|