metadata

license: mit
tags:
  - RLinf
language:
  - en
metrics:
  - accuracy
base_model:
  - Haozhan72/Openvla-oft-SFT-libero10-traj1
pipeline_tag: reinforcement-learning
model-index:
  - name: RLinf-OpenVLAOFT-GRPO-LIBERO-10
    results:
      - task:
          type: VLA
        dataset:
          type: libero_10
          name: libero_10
        metrics:
          - type: accuracy
            value: 94.35

RLinf: Reinforcement Learning Infrastructure for Agentic AI

RLinf is a flexible and scalable open-source infrastructure designed for post-training foundation models (LLMs, VLMs, VLAs) via reinforcement learning. The 'inf' in RLinf stands for Infrastructure, highlighting its role as a robust backbone for next-generation training. It also stands for Infinite, symbolizing the system’s support for open-ended learning, continuous generalization, and limitless possibilities in intelligence development.

Model Description

The RLinf-openvlaoft-libero series is trained on Haozhan72/Openvla-oft-SFT-libero-xxx-traj1 (including libero10, libero-object, libero-goal and libero-spatial), using the same base models and training datasets as verl. Training with RLinf yields SOTA performance.

We use a mask to focus on valid action tokens, and compute token-level loss based on the Group Relative Policy Optimization (GRPO) advantage function, in order to enhance the model’s performance on spatial reasoning, object generalization, instruction generalization, and long-horizon tasks.

Evaluation and Results

We trained and evaluated four models using RLinf:

RLinf-openvlaoft-libero-object Model (based on Haozhan72/Openvla-oft-SFT-libero-object-traj1)
- Recommended sampling settings: temperature = 1.6, top_p = 1.0
RLinf-openvlaoft-libero-spatial Model (based on Haozhan72/Openvla-oft-SFT-libero-spatial-traj1)
- Recommended sampling settings: temperature = 1.6, top_p = 1.0
RLinf-openvlaoft-libero-goal Model (based on Haozhan72/Openvla-oft-SFT-libero-goal-traj1)
- Recommended sampling settings: temperature = 1.6, top_p = 1.0
RLinf-openvlaoft-libero10 Model (based on Haozhan72/Openvla-oft-SFT-libero10-traj1)
- Recommended sampling settings: temperature = 1.6, top_p = 1.0

Benchmark Results

All sft models are from SimpleVLA-RL.

Recommended sampleing setting for evaluation: libero seed=0; episode number=500; do_sample=False

Model	Object	Spatial	Goal	Long	Average
sft models	25.60	56.45	45.59	9.68	34.33
trained with RLinf	98.99	98.99	98.99	94.35	97.83

How to Use

Please integrate the provided model with the RLinf codebase. To do so, modify the following parameters in the configuration file examples/embodiment/config/libero_10_grpo_openvlaoft.yaml:

Set actor.checkpoint_load_path, actor.tokenizer.tokenizer_model, and rollout.model_dir to the path of the model checkpoint.

Note: If you intend to evaluate the model directly, make sure to set actor.model.is_lora to false.

License

This code repository and the model weights are licensed under the MIT License.