juexzz's picture
Update README.md
85f944c verified
metadata
license: apache-2.0
base_model:
  - lerobot/pi0
pipeline_tag: robotics

INTACT Probing Suite: Pi0 from scratch on BridgeV2

๐Ÿ“ฆ This model is part of the INTACT Probing Suite Collection
Explore other variants:

INTACT-pi0-scratch-bridge

This repository contains a checkpoint of the Pi0 model (HF implementation | Paper) initialized from PaliGemma and trained directly ("from scratch") on the BridgeV2 dataset for robotic manipulation tasks. The model is later used for testing on the Simpler Environment and our INTACT Probing Suite for the generalization boundaries of VLA models.

Paper: From Intention to Execution: Probing the Generalization Boundaries of Vision-Language-Action Models

Model Details

  • Base Model: lerobot/pi0
  • Training Dataset: BridgeV2
  • Model Type: Vision-Language-Action (VLA) model for robotics
  • Fine-tuning Method: See our paper
  • Training Framework: See our repository

Quick Start

Usage in INTACT

git clone --recurse-submodules https://github.com/ai4ce/INT-ACT.git
cd INT-ACT
uv sync
source .venv/bin/activate
python 

Or directly in python with Lerobot, see blow:

Integration with LeRobot

First, install lerobot

pip install lerobot

Then

import torch
from lerobot.common.policies.pi0.modeling_pi0 import Pi0Policy

# Load model
policy = Pi0Policy.from_pretrained("juexzz/INTACT-pi0-scratch-bridge")

# Inference
with torch.no_grad():
    actions = policy.select_action(batch)

Training Configuration

  • Training Steps: 15 epochs ~22695 steps.
  • Batch Size: 1024
  • Learning Rate: 1e-5
  • Hardware: 4 H100/A100
  • Input Modalities: single image (to work with SimplerEnv), 1 language instruction, 1 robot state.
  • Output: robot actions (delta EEF) with chunk size of 4. For more details please refer to our paper and code

Evaluation

Checkpoint choice After training 15 epochs, we sweep the checkpoint at epoch 1, 2, 3, 4, 5, 10, 15 for performance on the original 4 Bridge tasks in the SimplerEnv, and choose the checkpoint with best average performance for each of the three Pi0 variants. Therefore, you may still get a better success rate for a specific task at other checkpoints. As a result, the best checkpoint for this pi0 finetune model is at step 22695 (epoch 15).

The comparison of their performance on Simpler are shown below.

Performance Comparison on SimplerEnv

Success rate comparison on the SimplerEnv with other pi0 variants and some other baselines experimented in our INTACT suite. For a more detailed comparison, please refer to the paper.

Model carrot_on_plate eggplant_in_basket stack_cube spoon_on_towel
Pi0 finetune 0.361 0.819 0.264 0.458
Pi0 finetune rephrase 0.500 0.944 0.222 0.597
Pi0 scratch(this model) 0.542 0.903 0.403 0.875
Spatial VLA 0.125 0.958 0.292 0.208
Magma 0.250 0.611 0.097 0.208
Octo Small 0.014 0.097 0.000 0.097
Octo Base 0.014 0.306 0.000 0.014

Citation

If you use this model in your research, please cite:

@article{fang2025intention,
  title={From Intention to Execution: Probing the Generalization Boundaries of Vision-Language-Action Models},
  author={Fang, Irving and Zhang, Juexiao and Tong, Shengbang and Feng, Chen},
  journal={arXiv preprint arXiv:2506.09930},
  year={2025}
}

Related Work

License

This model is released under the Apache 2.0 license. Please see the base model's license for any additional restrictions.

Support

For questions about this model:

  • ๐Ÿ“ง Open an issue in this repository
  • ๐Ÿ’ฌ Discussion tab for community questions
  • ๐Ÿ“– Check our paper for technical details

Last updated: June 2025