RLinf: Reinforcement Learning Infrastructure for Agentic AI
RLinf is a flexible and scalable open-source infrastructure designed for post-training foundation models (LLMs, VLMs, VLAs) via reinforcement learning. The 'inf' in RLinf stands for Infrastructure, highlighting its role as a robust backbone for next-generation training. It also stands for Infinite, symbolizing the system’s support for open-ended learning, continuous generalization, and limitless possibilities in intelligence development.

Model Description
This openvla-oft model is trained on Haozhan72/Openvla-oft-SFT-libero10-trajall
with an additional lora SFT checkpoint and finetuned by Proximal Policy Optimization (PPO) on the ManiSkill simulator.
Full OOD Evaluation and Results
Overall OOD Eval Results
Note: rl4vla refers to the paper VLA-RL-Study: What Can RL Bring to VLA Generalization? An Empirical Study.
Description | rl4vla | GRPO-openvlaoft | PPO-openvlaoft | PPO-openvla | GRPO-openvla |
---|---|---|---|---|---|
Avg results | 0.7608 | 0.61484375 | 0.6453125 | 0.822135417 | 0.7546875 |
OOD Eval on Vision
Description | rl4vla | GRPO-openvlaoft | PPO-openvlaoft | PPO-openvla | GRPO-openvla |
---|---|---|---|---|---|
vision avg | 0.7656 | 0.846875 | 0.80546875 | 0.8203125 | 0.746875 |
unseen table | 0.844 | 0.9140625 | 0.9453125 | 0.95703125 | 0.8984375 |
dynamic texture (weak) | 0.833 | 0.91015625 | 0.82421875 | 0.85546875 | 0.7890625 |
dynamic texture (strong) | 0.63 | 0.7734375 | 0.625 | 0.72265625 | 0.65625 |
dynamic noise (weak) | 0.854 | 0.89453125 | 0.8984375 | 0.87109375 | 0.796875 |
dynamic noise (strong) | 0.667 | 0.7421875 | 0.734375 | 0.6953125 | 0.59375 |
OOD Eval on Semantic
Description | rl4vla | GRPO-openvlaoft | PPO-openvlaoft | PPO-openvla | GRPO-openvla |
---|---|---|---|---|---|
object avg | 0.754 | 0.516113281 | 0.56640625 | 0.805664063 | 0.744140625 |
train setting | 0.938 | 0.94140625 | 0.91796875 | 0.9609375 | 0.84375 |
unseen objects | 0.714 | 0.8046875 | 0.77734375 | 0.81640625 | 0.765625 |
unseen receptacles | 0.75 | 0.7421875 | 0.78125 | 0.8125 | 0.734375 |
unseen instructions | 0.891 | 0.6796875 | 0.68359375 | 0.9453125 | 0.890625 |
multi-object (both seen) | 0.75 | 0.3515625 | 0.4296875 | 0.84375 | 0.7578125 |
multi-object (both unseen) | 0.578 | 0.3046875 | 0.38671875 | 0.62890625 | 0.578125 |
distractive receptacle | 0.812 | 0.1875 | 0.31640625 | 0.828125 | 0.78125 |
multi-receptacle (both unseen) | 0.599 | 0.1171875 | 0.23828125 | 0.609375 | 0.6015625 |
OOD Eval on Position
Description | rl4vla | GRPO-openvlaoft | PPO-openvlaoft | PPO-openvla | GRPO-openvla |
---|---|---|---|---|---|
position avg | 0.776 | 0.4296875 | 0.560546875 | 0.892578125 | 0.81640625 |
unseen position (object & receptacle) | 0.807 | 0.40234375 | 0.50390625 | 0.86328125 | 0.75 |
mid-episode object reposition | 0.745 | 0.45703125 | 0.6171875 | 0.921875 | 0.8828125 |
How to Use
Please integrate the provided model with the RLinf codebase. To do so, modify the following parameters in the configuration file examples/embodiment/config/maniskill_ppo_openvlaoft.yaml
:
- Set
actor.checkpoint_load_path
,actor.tokenizer.tokenizer_model
, androllout.model_dir
to the path of the model checkpoint.
Note: If you intend to evaluate the model directly, make sure to set actor.model.is_lora
to false
.
License
This code repository and the model weights are licensed under the MIT License.
- Downloads last month
- -
Evaluation results
- accuracy on maniskill-visionself-reported80.500
- accuracy on maniskill-semanticself-reported56.600
- accuracy on maniskill-positionself-reported56.100