RLinf: Reinforcement Learning Infrastructure for Agentic AI

RLinf is a flexible and scalable open-source infrastructure designed for post-training foundation models (LLMs, VLMs, VLAs) via reinforcement learning. The 'inf' in RLinf stands for Infrastructure, highlighting its role as a robust backbone for next-generation training. It also stands for Infinite, symbolizing the system’s support for open-ended learning, continuous generalization, and limitless possibilities in intelligence development.

Model Description

This openvla-oft model is trained on Haozhan72/Openvla-oft-SFT-libero10-trajall with an additional lora SFT checkpoint and finetuned by Proximal Policy Optimization (PPO) on the ManiSkill simulator.

Full OOD Evaluation and Results

Overall OOD Eval Results

Note: rl4vla refers to the paper VLA-RL-Study: What Can RL Bring to VLA Generalization? An Empirical Study.

Description	rl4vla	GRPO-openvlaoft	PPO-openvlaoft	PPO-openvla	GRPO-openvla
Avg results	0.7608	0.61484375	0.6453125	0.822135417	0.7546875

OOD Eval on Vision

Description	rl4vla	GRPO-openvlaoft	PPO-openvlaoft	PPO-openvla	GRPO-openvla
vision avg	0.7656	0.846875	0.80546875	0.8203125	0.746875
unseen table	0.844	0.9140625	0.9453125	0.95703125	0.8984375
dynamic texture (weak)	0.833	0.91015625	0.82421875	0.85546875	0.7890625
dynamic texture (strong)	0.63	0.7734375	0.625	0.72265625	0.65625
dynamic noise (weak)	0.854	0.89453125	0.8984375	0.87109375	0.796875
dynamic noise (strong)	0.667	0.7421875	0.734375	0.6953125	0.59375

OOD Eval on Semantic

Description	rl4vla	GRPO-openvlaoft	PPO-openvlaoft	PPO-openvla	GRPO-openvla
object avg	0.754	0.516113281	0.56640625	0.805664063	0.744140625
train setting	0.938	0.94140625	0.91796875	0.9609375	0.84375
unseen objects	0.714	0.8046875	0.77734375	0.81640625	0.765625
unseen receptacles	0.75	0.7421875	0.78125	0.8125	0.734375
unseen instructions	0.891	0.6796875	0.68359375	0.9453125	0.890625
multi-object (both seen)	0.75	0.3515625	0.4296875	0.84375	0.7578125
multi-object (both unseen)	0.578	0.3046875	0.38671875	0.62890625	0.578125
distractive receptacle	0.812	0.1875	0.31640625	0.828125	0.78125
multi-receptacle (both unseen)	0.599	0.1171875	0.23828125	0.609375	0.6015625

OOD Eval on Position

Description	rl4vla	GRPO-openvlaoft	PPO-openvlaoft	PPO-openvla	GRPO-openvla
position avg	0.776	0.4296875	0.560546875	0.892578125	0.81640625
unseen position (object & receptacle)	0.807	0.40234375	0.50390625	0.86328125	0.75
mid-episode object reposition	0.745	0.45703125	0.6171875	0.921875	0.8828125

How to Use

Please integrate the provided model with the RLinf codebase. To do so, modify the following parameters in the configuration file examples/embodiment/config/maniskill_ppo_openvlaoft.yaml:

Set actor.checkpoint_load_path, actor.tokenizer.tokenizer_model, and rollout.model_dir to the path of the model checkpoint.

Note: If you intend to evaluate the model directly, make sure to set actor.model.is_lora to false.

License

This code repository and the model weights are licensed under the MIT License.

RLinf
/

RLinf-OpenVLAOFT-PPO-ManiSkill3-25ood