RLinf-logo

RLinf: Reinforcement Learning Infrastructure for Agentic AI

RLinf is a flexible and scalable open-source infrastructure designed for post-training foundation models (LLMs, VLMs, VLAs) via reinforcement learning. The 'inf' in RLinf stands for Infrastructure, highlighting its role as a robust backbone for next-generation training. It also stands for Infinite, symbolizing the system’s support for open-ended learning, continuous generalization, and limitless possibilities in intelligence development.

RLinf-overview

Model Description

This openvla-oft model is trained on Haozhan72/Openvla-oft-SFT-libero10-trajall with an additional lora SFT checkpoint and finetuned by Group Relative Policy Optimization (GRPO) on the ManiSkill simulator.

Full OOD Evaluation and Results

Overall OOD Eval Results

Note: rl4vla refers to the paper VLA-RL-Study: What Can RL Bring to VLA Generalization? An Empirical Study.

Description rl4vla GRPO-openvlaoft PPO-openvlaoft PPO-openvla GRPO-openvla
Avg results 76.08 61.48 64.53 82.21 75.47

OOD Eval on Vision

Description rl4vla GRPO-openvlaoft PPO-openvlaoft PPO-openvla GRPO-openvla
vision avg 76.56 84.69 80.55 82.03 74.69
unseen table 84.40 91.41 94.53 95.70 89.84
dynamic texture (weak) 83.30 91.02 82.42 85.55 78.91
dynamic texture (strong) 63.00 77.34 62.50 72.27 65.62
dynamic noise (weak) 85.40 89.45 89.84 87.11 79.69
dynamic noise (strong) 66.70 74.22 73.44 69.53 59.38

OOD Eval on Semantic

Description rl4vla GRPO-openvlaoft PPO-openvlaoft PPO-openvla GRPO-openvla
object avg 75.40 51.61 56.64 80.57 74.41
train setting 93.80 94.14 91.80 96.09 84.38
unseen objects 71.40 80.47 77.73 81.64 76.56
unseen receptacles 75.00 74.22 78.12 81.25 73.44
unseen instructions 89.10 67.97 68.36 94.53 89.06
multi-object (both seen) 75.00 35.16 42.97 84.38 75.78
multi-object (both unseen) 57.80 30.47 38.67 62.89 57.81
distractive receptacle 81.20 18.75 31.64 82.81 78.12
multi-receptacle (both unseen) 59.90 11.72 23.83 60.94 60.16

OOD Eval on Position

Description rl4vla GRPO-openvlaoft PPO-openvlaoft PPO-openvla GRPO-openvla
position avg 77.60 42.97 56.05 89.26 81.64
unseen position (object & receptacle) 80.70 40.23 50.39 86.33 75.00
mid-episode object reposition 74.50 45.70 61.72 92.19 88.28

How to Use

Please integrate the provided model with the RLinf codebase. To do so, modify the following parameters in the configuration file examples/embodiment/config/maniskill_grpo_openvlaoft.yaml:

  • Set actor.checkpoint_load_path, actor.tokenizer.tokenizer_model, and rollout.model_dir to the path of the model checkpoint.

Note: If you intend to evaluate the model directly, make sure to set actor.model.is_lora to false.

License

This code repository and the model weights are licensed under the MIT License.

Downloads last month
1
Safetensors
Model size
7.54B params
Tensor type
BF16
·
Video Preview
loading

Evaluation results