--- pipeline_tag: robotics library_name: transformers license: mit --- # Fine-Tuning Vision-Language-Action Models: Optimizing Speed and Success This repository contains the OpenVLA-OFT checkpoint trained on 4 LIBERO task suites combined (-Spatial, -Object, -Goal, -Long), as described in [Fine-Tuning Vision-Language-Action Models: Optimizing Speed and Success](https://arxiv.org/abs/2502.19645). OpenVLA-OFT significantly improves upon the base OpenVLA model by incorporating optimized fine-tuning techniques. Project Page: https://openvla-oft.github.io/ Code: https://github.com/openvla-oft/openvla-oft See here for other OpenVLA-OFT checkpoints: https://huggingface.co/moojink?search_models=oft ## Quick Start This example demonstrates generating an action chunk using a pretrained OpenVLA-OFT checkpoint. Ensure you have set up the conda environment as described in the GitHub README. ```python import pickle from experiments.robot.libero.run_libero_eval import GenerateConfig from experiments.robot.openvla_utils import get_action_head, get_processor, get_proprio_projector, get_vla, get_vla_action from prismatic.vla.constants import NUM_ACTIONS_CHUNK, PROPRIO_DIM # Instantiate config (see class GenerateConfig in experiments/robot/libero/run_libero_eval.py for definitions) cfg = GenerateConfig( pretrained_checkpoint = "moojink/openvla-7b-oft-finetuned-libero-spatial", use_l1_regression = True, use_diffusion = False, use_film = False, num_images_in_input = 2, use_proprio = True, load_in_8bit = False, load_in_4bit = False, center_crop = True, num_open_loop_steps = NUM_ACTIONS_CHUNK, unnorm_key = "libero_spatial_no_noops", ) # Load OpenVLA-OFT policy and inputs processor vla = get_vla(cfg) processor = get_processor(cfg) # Load MLP action head to generate continuous actions (via L1 regression) action_head = get_action_head(cfg, llm_dim=vla.llm_dim) # Load proprio projector to map proprio to language embedding space proprio_projector = get_proprio_projector(cfg, llm_dim=vla.llm_dim, proprio_dim=PROPRIO_DIM) # Load sample observation: # observation (dict): { # "full_image": primary third-person image, # "wrist_image": wrist-mounted camera image, # "state": robot proprioceptive state, # "task_description": task description, # } with open("experiments/robot/libero/sample_libero_spatial_observation.pkl", "rb") as file: observation = pickle.load(file) # Generate robot action chunk (sequence of future actions) actions = get_vla_action(cfg, vla, processor, observation, observation["task_description"], action_head, proprio_projector) print("Generated action chunk:") for act in actions: print(act) ``` ## Citation ```bibtex @article{kim2025fine, title={Fine-Tuning Vision-Language-Action Models: Optimizing Speed and Success}, author={Kim, Moo Jin and Finn, Chelsea and Liang, Percy}, journal={arXiv preprint arXiv:2502.19645}, year={2025} } ```