|
--- |
|
pipeline_tag: robotics |
|
library_name: transformers |
|
license: mit |
|
--- |
|
# Fine-Tuning Vision-Language-Action Models: Optimizing Speed and Success |
|
|
|
This repository contains the OpenVLA-OFT checkpoint trained on 4 LIBERO task suites combined (-Spatial, -Object, -Goal, -Long), as described in [Fine-Tuning Vision-Language-Action Models: Optimizing Speed and Success](https://arxiv.org/abs/2502.19645). OpenVLA-OFT significantly improves upon the base OpenVLA model by incorporating optimized fine-tuning techniques. |
|
|
|
Project Page: https://openvla-oft.github.io/ |
|
|
|
Code: https://github.com/openvla-oft/openvla-oft |
|
|
|
See here for other OpenVLA-OFT checkpoints: https://huggingface.co/moojink?search_models=oft |
|
## Quick Start |
|
This example demonstrates generating an action chunk using a pretrained OpenVLA-OFT checkpoint. Ensure you have set up the conda environment as described in the GitHub README. |
|
```python |
|
import pickle |
|
from experiments.robot.libero.run_libero_eval import GenerateConfig |
|
from experiments.robot.openvla_utils import get_action_head, get_processor, get_proprio_projector, get_vla, get_vla_action |
|
from prismatic.vla.constants import NUM_ACTIONS_CHUNK, PROPRIO_DIM |
|
# Instantiate config (see class GenerateConfig in experiments/robot/libero/run_libero_eval.py for definitions) |
|
cfg = GenerateConfig( |
|
pretrained_checkpoint = "moojink/openvla-7b-oft-finetuned-libero-spatial", |
|
use_l1_regression = True, |
|
use_diffusion = False, |
|
use_film = False, |
|
num_images_in_input = 2, |
|
use_proprio = True, |
|
load_in_8bit = False, |
|
load_in_4bit = False, |
|
center_crop = True, |
|
num_open_loop_steps = NUM_ACTIONS_CHUNK, |
|
unnorm_key = "libero_spatial_no_noops", |
|
) |
|
# Load OpenVLA-OFT policy and inputs processor |
|
vla = get_vla(cfg) |
|
processor = get_processor(cfg) |
|
# Load MLP action head to generate continuous actions (via L1 regression) |
|
action_head = get_action_head(cfg, llm_dim=vla.llm_dim) |
|
# Load proprio projector to map proprio to language embedding space |
|
proprio_projector = get_proprio_projector(cfg, llm_dim=vla.llm_dim, proprio_dim=PROPRIO_DIM) |
|
|
|
# Load sample observation: |
|
# observation (dict): { |
|
# "full_image": primary third-person image, |
|
# "wrist_image": wrist-mounted camera image, |
|
# "state": robot proprioceptive state, |
|
# "task_description": task description, |
|
# } |
|
with open("experiments/robot/libero/sample_libero_spatial_observation.pkl", "rb") as file: |
|
observation = pickle.load(file) |
|
# Generate robot action chunk (sequence of future actions) |
|
actions = get_vla_action(cfg, vla, processor, observation, observation["task_description"], action_head, proprio_projector) |
|
print("Generated action chunk:") |
|
for act in actions: |
|
print(act) |
|
``` |
|
## Citation |
|
```bibtex |
|
@article{kim2025fine, |
|
title={Fine-Tuning Vision-Language-Action Models: Optimizing Speed and Success}, |
|
author={Kim, Moo Jin and Finn, Chelsea and Liang, Percy}, |
|
journal={arXiv preprint arXiv:2502.19645}, |
|
year={2025} |
|
} |
|
``` |