license: apache-2.0
base_model:
- lerobot/pi0
pipeline_tag: robotics
INTACT Probing Suite: Pi0 from scratch on BridgeV2
๐ฆ This model is part of the INTACT Probing Suite Collection
Explore other variants:
INTACT-pi0-scratch-bridge
This repository contains a checkpoint of the Pi0 model (HF implementation | Paper) initialized from PaliGemma and trained directly ("from scratch") on the BridgeV2 dataset for robotic manipulation tasks. The model is later used for testing on the Simpler Environment and our INTACT Probing Suite for the generalization boundaries of VLA models.
Paper: From Intention to Execution: Probing the Generalization Boundaries of Vision-Language-Action Models
Model Details
- Base Model: lerobot/pi0
- Training Dataset: BridgeV2
- Model Type: Vision-Language-Action (VLA) model for robotics
- Fine-tuning Method: See our paper
- Training Framework: See our repository
Quick Start
Usage in INTACT
git clone --recurse-submodules https://github.com/ai4ce/INT-ACT.git
cd INT-ACT
uv sync
source .venv/bin/activate
python
Or directly in python with Lerobot, see blow:
Integration with LeRobot
First, install lerobot
pip install lerobot
Then
import torch
from lerobot.common.policies.pi0.modeling_pi0 import Pi0Policy
# Load model
policy = Pi0Policy.from_pretrained("juexzz/INTACT-pi0-scratch-bridge")
# Inference
with torch.no_grad():
actions = policy.select_action(batch)
Training Configuration
- Training Steps: 15 epochs ~22695 steps.
- Batch Size: 1024
- Learning Rate: 1e-5
- Hardware: 4 H100/A100
- Input Modalities: single image (to work with SimplerEnv), 1 language instruction, 1 robot state.
- Output: robot actions (delta EEF) with chunk size of 4. For more details please refer to our paper and code
Evaluation
Checkpoint choice After training 15 epochs, we sweep the checkpoint at epoch 1, 2, 3, 4, 5, 10, 15 for performance on the original 4 Bridge tasks in the SimplerEnv, and choose the checkpoint with best average performance for each of the three Pi0 variants. Therefore, you may still get a better success rate for a specific task at other checkpoints. As a result, the best checkpoint for this pi0 finetune model is at step 22695 (epoch 15).
The comparison of their performance on Simpler are shown below.
Performance Comparison on SimplerEnv
Success rate comparison on the SimplerEnv with other pi0 variants and some other baselines experimented in our INTACT suite. For a more detailed comparison, please refer to the paper.
Model | carrot_on_plate | eggplant_in_basket | stack_cube | spoon_on_towel |
---|---|---|---|---|
Pi0 finetune | 0.361 | 0.819 | 0.264 | 0.458 |
Pi0 finetune rephrase | 0.500 | 0.944 | 0.222 | 0.597 |
Pi0 scratch(this model) | 0.542 | 0.903 | 0.403 | 0.875 |
Spatial VLA | 0.125 | 0.958 | 0.292 | 0.208 |
Magma | 0.250 | 0.611 | 0.097 | 0.208 |
Octo Small | 0.014 | 0.097 | 0.000 | 0.097 |
Octo Base | 0.014 | 0.306 | 0.000 | 0.014 |
Citation
If you use this model in your research, please cite:
@article{fang2025intention,
title={From Intention to Execution: Probing the Generalization Boundaries of Vision-Language-Action Models},
author={Fang, Irving and Zhang, Juexiao and Tong, Shengbang and Feng, Chen},
journal={arXiv preprint arXiv:2506.09930},
year={2025}
}
Related Work
- Pi0 (official): pi0 (JAX)
- Base Model (Pi0 HF): lerobot/pi0
- Dataset: BridgeV2
- Framework: LeRobot
- Simpler Environment: SimplerEnv
- Open-source Pi0 Implementation by Allen Ren: open-pi-zero
License
This model is released under the Apache 2.0 license. Please see the base model's license for any additional restrictions.
Support
For questions about this model:
- ๐ง Open an issue in this repository
- ๐ฌ Discussion tab for community questions
- ๐ Check our paper for technical details
Last updated: June 2025