RLinf: Reinforcement Learning Infrastructure for Agentic AI

RLinf is a flexible and scalable open-source infrastructure designed for post-training foundation models (LLMs, VLMs, VLAs) via reinforcement learning. The 'inf' in RLinf stands for Infrastructure, highlighting its role as a robust backbone for next-generation training. It also stands for Infinite, symbolizing the system’s support for open-ended learning, continuous generalization, and limitless possibilities in intelligence development.

Model Description

This openvla-oft model is trained on Haozhan72/Openvla-oft-SFT-libero10-trajall with an additional lora SFT checkpoint and finetuned by Group Relative Policy Optimization (GRPO) on the ManiSkill simulator.

Full OOD Evaluation and Results

Overall OOD Eval Results

Note: rl4vla refers to the paper VLA-RL-Study: What Can RL Bring to VLA Generalization? An Empirical Study.

Description	rl4vla	GRPO-openvlaoft	PPO-openvlaoft	PPO-openvla	GRPO-openvla
Avg results	76.08	61.48	64.53	82.21	75.47

OOD Eval on Vision

Description	rl4vla	GRPO-openvlaoft	PPO-openvlaoft	PPO-openvla	GRPO-openvla
vision avg	76.56	84.69	80.55	82.03	74.69
unseen table	84.40	91.41	94.53	95.70	89.84
dynamic texture (weak)	83.30	91.02	82.42	85.55	78.91
dynamic texture (strong)	63.00	77.34	62.50	72.27	65.62
dynamic noise (weak)	85.40	89.45	89.84	87.11	79.69
dynamic noise (strong)	66.70	74.22	73.44	69.53	59.38

OOD Eval on Semantic

Description	rl4vla	GRPO-openvlaoft	PPO-openvlaoft	PPO-openvla	GRPO-openvla
object avg	75.40	51.61	56.64	80.57	74.41
train setting	93.80	94.14	91.80	96.09	84.38
unseen objects	71.40	80.47	77.73	81.64	76.56
unseen receptacles	75.00	74.22	78.12	81.25	73.44
unseen instructions	89.10	67.97	68.36	94.53	89.06
multi-object (both seen)	75.00	35.16	42.97	84.38	75.78
multi-object (both unseen)	57.80	30.47	38.67	62.89	57.81
distractive receptacle	81.20	18.75	31.64	82.81	78.12
multi-receptacle (both unseen)	59.90	11.72	23.83	60.94	60.16

OOD Eval on Position

Description	rl4vla	GRPO-openvlaoft	PPO-openvlaoft	PPO-openvla	GRPO-openvla
position avg	77.60	42.97	56.05	89.26	81.64
unseen position (object & receptacle)	80.70	40.23	50.39	86.33	75.00
mid-episode object reposition	74.50	45.70	61.72	92.19	88.28

How to Use

Please integrate the provided model with the RLinf codebase. To do so, modify the following parameters in the configuration file examples/embodiment/config/maniskill_grpo_openvlaoft.yaml:

Set actor.checkpoint_load_path, actor.tokenizer.tokenizer_model, and rollout.model_dir to the path of the model checkpoint.

Note: If you intend to evaluate the model directly, make sure to set actor.model.is_lora to false.

License

This code repository and the model weights are licensed under the MIT License.

RLinf
/

RLinf-OpenVLAOFT-GRPO-ManiSkill3-25ood