RLinf: Reinforcement Learning Infrastructure for Agentic AI
RLinf is a flexible and scalable open-source infrastructure designed for post-training foundation models (LLMs, VLMs, VLAs) via reinforcement learning. The 'inf' in RLinf stands for Infrastructure, highlighting its role as a robust backbone for next-generation training. It also stands for Infinite, symbolizing the system’s support for open-ended learning, continuous generalization, and limitless possibilities in intelligence development.

Model Description
This openvla-oft model is trained on Haozhan72/Openvla-oft-SFT-libero10-trajall
with an additional lora SFT checkpoint and finetuned by Group Relative Policy Optimization (GRPO) on the ManiSkill simulator.
Full OOD Evaluation and Results
Overall OOD Eval Results
Note: rl4vla refers to the paper VLA-RL-Study: What Can RL Bring to VLA Generalization? An Empirical Study.
Description | rl4vla | GRPO-openvlaoft | PPO-openvlaoft | PPO-openvla | GRPO-openvla |
---|---|---|---|---|---|
Avg results | 76.08 | 61.48 | 64.53 | 82.21 | 75.47 |
OOD Eval on Vision
Description | rl4vla | GRPO-openvlaoft | PPO-openvlaoft | PPO-openvla | GRPO-openvla |
---|---|---|---|---|---|
vision avg | 76.56 | 84.69 | 80.55 | 82.03 | 74.69 |
unseen table | 84.40 | 91.41 | 94.53 | 95.70 | 89.84 |
dynamic texture (weak) | 83.30 | 91.02 | 82.42 | 85.55 | 78.91 |
dynamic texture (strong) | 63.00 | 77.34 | 62.50 | 72.27 | 65.62 |
dynamic noise (weak) | 85.40 | 89.45 | 89.84 | 87.11 | 79.69 |
dynamic noise (strong) | 66.70 | 74.22 | 73.44 | 69.53 | 59.38 |
OOD Eval on Semantic
Description | rl4vla | GRPO-openvlaoft | PPO-openvlaoft | PPO-openvla | GRPO-openvla |
---|---|---|---|---|---|
object avg | 75.40 | 51.61 | 56.64 | 80.57 | 74.41 |
train setting | 93.80 | 94.14 | 91.80 | 96.09 | 84.38 |
unseen objects | 71.40 | 80.47 | 77.73 | 81.64 | 76.56 |
unseen receptacles | 75.00 | 74.22 | 78.12 | 81.25 | 73.44 |
unseen instructions | 89.10 | 67.97 | 68.36 | 94.53 | 89.06 |
multi-object (both seen) | 75.00 | 35.16 | 42.97 | 84.38 | 75.78 |
multi-object (both unseen) | 57.80 | 30.47 | 38.67 | 62.89 | 57.81 |
distractive receptacle | 81.20 | 18.75 | 31.64 | 82.81 | 78.12 |
multi-receptacle (both unseen) | 59.90 | 11.72 | 23.83 | 60.94 | 60.16 |
OOD Eval on Position
Description | rl4vla | GRPO-openvlaoft | PPO-openvlaoft | PPO-openvla | GRPO-openvla |
---|---|---|---|---|---|
position avg | 77.60 | 42.97 | 56.05 | 89.26 | 81.64 |
unseen position (object & receptacle) | 80.70 | 40.23 | 50.39 | 86.33 | 75.00 |
mid-episode object reposition | 74.50 | 45.70 | 61.72 | 92.19 | 88.28 |
How to Use
Please integrate the provided model with the RLinf codebase. To do so, modify the following parameters in the configuration file examples/embodiment/config/maniskill_grpo_openvlaoft.yaml
:
- Set
actor.checkpoint_load_path
,actor.tokenizer.tokenizer_model
, androllout.model_dir
to the path of the model checkpoint.
Note: If you intend to evaluate the model directly, make sure to set actor.model.is_lora
to false
.
License
This code repository and the model weights are licensed under the MIT License.
- Downloads last month
- 1
Evaluation results
- accuracy on maniskill-visionself-reported84.600
- accuracy on maniskill-semanticself-reported51.600
- accuracy on maniskill-positionself-reported42.900