moojink
/

openvla-7b-oft-finetuned-libero-spatial-object-goal-10

Model card Files Files and versions Community

openvla-7b-oft-finetuned-libero-spatial-object-goal-10 / README.md

moojink's picture

Add README.md

638918f verified 3 months ago

|

history blame contribute delete

2.95 kB

	---
	pipeline_tag: robotics
	library_name: transformers
	license: mit
	---
	# Fine-Tuning Vision-Language-Action Models: Optimizing Speed and Success

	This repository contains the OpenVLA-OFT checkpoint trained on 4 LIBERO task suites combined (-Spatial, -Object, -Goal, -Long), as described in [Fine-Tuning Vision-Language-Action Models: Optimizing Speed and Success](https://arxiv.org/abs/2502.19645). OpenVLA-OFT significantly improves upon the base OpenVLA model by incorporating optimized fine-tuning techniques.

	Project Page: https://openvla-oft.github.io/

	Code: https://github.com/openvla-oft/openvla-oft

	See here for other OpenVLA-OFT checkpoints: https://huggingface.co/moojink?search_models=oft
	## Quick Start
	This example demonstrates generating an action chunk using a pretrained OpenVLA-OFT checkpoint. Ensure you have set up the conda environment as described in the GitHub README.
	```python
	import pickle
	from experiments.robot.libero.run_libero_eval import GenerateConfig
	from experiments.robot.openvla_utils import get_action_head, get_processor, get_proprio_projector, get_vla, get_vla_action
	from prismatic.vla.constants import NUM_ACTIONS_CHUNK, PROPRIO_DIM
	# Instantiate config (see class GenerateConfig in experiments/robot/libero/run_libero_eval.py for definitions)
	cfg = GenerateConfig(
	pretrained_checkpoint = "moojink/openvla-7b-oft-finetuned-libero-spatial",
	use_l1_regression = True,
	use_diffusion = False,
	use_film = False,
	num_images_in_input = 2,
	use_proprio = True,
	load_in_8bit = False,
	load_in_4bit = False,
	center_crop = True,
	num_open_loop_steps = NUM_ACTIONS_CHUNK,
	unnorm_key = "libero_spatial_no_noops",
	)
	# Load OpenVLA-OFT policy and inputs processor
	vla = get_vla(cfg)
	processor = get_processor(cfg)
	# Load MLP action head to generate continuous actions (via L1 regression)
	action_head = get_action_head(cfg, llm_dim=vla.llm_dim)
	# Load proprio projector to map proprio to language embedding space
	proprio_projector = get_proprio_projector(cfg, llm_dim=vla.llm_dim, proprio_dim=PROPRIO_DIM)

	# Load sample observation:
	# observation (dict): {
	# "full_image": primary third-person image,
	# "wrist_image": wrist-mounted camera image,
	# "state": robot proprioceptive state,
	# "task_description": task description,
	# }
	with open("experiments/robot/libero/sample_libero_spatial_observation.pkl", "rb") as file:
	observation = pickle.load(file)
	# Generate robot action chunk (sequence of future actions)
	actions = get_vla_action(cfg, vla, processor, observation, observation["task_description"], action_head, proprio_projector)
	print("Generated action chunk:")
	for act in actions:
	print(act)
	```
	## Citation
	```bibtex
	@article{kim2025fine,
	title={Fine-Tuning Vision-Language-Action Models: Optimizing Speed and Success},
	author={Kim, Moo Jin and Finn, Chelsea and Liang, Percy},
	journal={arXiv preprint arXiv:2502.19645},
	year={2025}
	}
	```