File size: 5,278 Bytes
ae40629 85f944c ae40629 |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 |
---
license: apache-2.0
base_model:
- lerobot/pi0
pipeline_tag: robotics
---
# INTACT Probing Suite: Pi0 from scratch on BridgeV2
> ๐ฆ **This model is part of the [INTACT Probing Suite Collection](https://huggingface.co/collections/ai4ce/intact-probing-suite-684e5601e9ed640fdd9b994b)**
> Explore other variants:
> - [Pi0 fintuned on BridgeV2](https://huggingface.co/juexzz/INTACT-pi0-finetune-bridge)
> - [Pi0 finetuned with paraphrase on BridgeV2](https://huggingface.co/juexzz/INTACT-pi0-finetune-rephrase-bridge)
## INTACT-pi0-scratch-bridge
This repository contains a checkpoint of the Pi0 model ([HF implementation](https://huggingface.co/lerobot/pi0) | [Paper](https://arxiv.org/abs/2410.24164v1)) *initialized from PaliGemma and trained directly ("from scratch")* on the BridgeV2 dataset for robotic manipulation tasks.
The model is later used for testing on the [Simpler Environment](https://github.com/simpler-env/SimplerEnv) and our [INTACT](https://github.com/ai4ce/INT-ACT) Probing Suite for the generalization boundaries of VLA models.
**Paper**: [From Intention to Execution: Probing the Generalization Boundaries of Vision-Language-Action Models](https://arxiv.org/abs/2506.09930)
## Model Details
- **Base Model**: [lerobot/pi0](https://huggingface.co/lerobot/pi0)
- **Training Dataset**: [BridgeV2](https://rail-berkeley.github.io/bridgedata/)
- **Model Type**: Vision-Language-Action (VLA) model for robotics
- **Fine-tuning Method**: See our [paper](https://arxiv.org/abs/2506.09930)
- **Training Framework**: See our [repository](https://github.com/ai4ce/INT-ACT)
## Quick Start
### Usage in INTACT
```shell
git clone --recurse-submodules https://github.com/ai4ce/INT-ACT.git
cd INT-ACT
uv sync
source .venv/bin/activate
python
```
Or directly in python with Lerobot, see blow:
### Integration with LeRobot
First, install lerobot
```bash
pip install lerobot
```
Then
```python
import torch
from lerobot.common.policies.pi0.modeling_pi0 import Pi0Policy
# Load model
policy = Pi0Policy.from_pretrained("juexzz/INTACT-pi0-scratch-bridge")
# Inference
with torch.no_grad():
actions = policy.select_action(batch)
```
### Training Configuration
- **Training Steps**: 15 epochs ~22695 steps.
- **Batch Size**: 1024
- **Learning Rate**: 1e-5
- **Hardware**: 4 H100/A100
- **Input Modalities**: single image (to work with SimplerEnv), 1 language instruction, 1 robot state.
- **Output**: robot actions (delta EEF) with chunk size of 4.
For more details please refer to our [paper](https://arxiv.org/abs/2506.09930) and [code](https://github.com/ai4ce/INT-ACT)
## Evaluation
**Checkpoint choice**
After training 15 epochs, we sweep the checkpoint at epoch 1, 2, 3, 4, 5, 10, 15 for performance on the original 4 Bridge tasks in the SimplerEnv, and choose the checkpoint with *best average performance* for each of the three Pi0 variants.
Therefore, you may still get a better success rate for a specific task at other checkpoints.
As a result, the best checkpoint for this pi0 finetune model is at step 22695 (epoch 15).
The comparison of their performance on Simpler are shown below.
### Performance Comparison on SimplerEnv
**Success rate** comparison on the SimplerEnv with other pi0 variants and some other baselines experimented in our INTACT suite.
For a more detailed comparison, please refer to the [paper](https://arxiv.org/abs/2506.09930).
| Model | carrot_on_plate | eggplant_in_basket | stack_cube | spoon_on_towel |
|-------|-----------------|-------------------|------------|----------------|
| [Pi0 finetune](https://huggingface.co/juexzz/INTACT-pi0-finetune-bridge) | 0.361 | 0.819 | 0.264 | 0.458 |
| [Pi0 finetune rephrase](https://huggingface.co/juexzz/INTACT-pi0-finetune-rephrase-bridge) | 0.500 | 0.944 | 0.222 | 0.597 |
| **Pi0 scratch(this model)** | 0.542 | 0.903 | 0.403 | 0.875 |
| Spatial VLA | 0.125 | 0.958 | 0.292 | 0.208 |
| Magma | 0.250 | 0.611 | 0.097 | 0.208 |
| Octo Small | 0.014 | 0.097 | 0.000 | 0.097 |
| Octo Base | 0.014 | 0.306 | 0.000 | 0.014 |
## Citation
If you use this model in your research, please cite:
```bibtex
@article{fang2025intention,
title={From Intention to Execution: Probing the Generalization Boundaries of Vision-Language-Action Models},
author={Fang, Irving and Zhang, Juexiao and Tong, Shengbang and Feng, Chen},
journal={arXiv preprint arXiv:2506.09930},
year={2025}
}
```
## Related Work
- **Pi0 (official)**: [pi0 (JAX)](https://github.com/Physical-Intelligence/openpi)
- **Base Model (Pi0 HF)**: [lerobot/pi0](https://huggingface.co/lerobot/pi0)
- **Dataset**: [BridgeV2](https://bridge-v2.github.io/)
- **Framework**: [LeRobot](https://github.com/huggingface/lerobot)
- **Simpler Environment**: [SimplerEnv](https://github.com/simpler-env/SimplerEnv)
- **Open-source Pi0 Implementation by Allen Ren**: [open-pi-zero](https://github.com/allenzren/open-pi-zero)
## License
This model is released under the Apache 2.0 license. Please see the base model's license for any additional restrictions.
## Support
For questions about this model:
- ๐ง Open an issue in this repository
- ๐ฌ Discussion tab for community questions
- ๐ Check our [paper](https://arxiv.org/abs/2506.09930) for technical details
---
*Last updated: June 2025* |