Create README.md
Browse files
README.md
ADDED
@@ -0,0 +1,140 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
---
|
2 |
+
license: apache-2.0
|
3 |
+
base_model:
|
4 |
+
- lerobot/pi0
|
5 |
+
pipeline_tag: robotics
|
6 |
+
---
|
7 |
+
|
8 |
+
# INTACT Probing Suite: Pi0 Fine-tuned on BridgeV2
|
9 |
+
|
10 |
+
> 📦 **This model is part of the [INTACT Probing Suite Collection](https://huggingface.co/collections/ai4ce/intact-probing-suite-684e5601e9ed640fdd9b994b)**
|
11 |
+
> Explore other variants:
|
12 |
+
> - [Pi0 fintuned on BridgeV2](https://huggingface.co/juexzz/INTACT-pi0-finetune-bridge)
|
13 |
+
> - [Pi0 finetuned with paraphrase on BridgeV2](https://huggingface.co/juexzz/INTACT-pi0-finetune-rephrase-bridge)
|
14 |
+
|
15 |
+
## INTACT-pi0-scratch-bridge
|
16 |
+
|
17 |
+
This repository contains a checkpoint of the Pi0 model ([HF implementation](https://huggingface.co/lerobot/pi0) | [Paper](https://arxiv.org/abs/2410.24164v1)) *initialized from PaliGemma and trained directly ("from scratch")* on the BridgeV2 dataset for robotic manipulation tasks.
|
18 |
+
The model is later used for testing on the [Simpler Environment](https://github.com/simpler-env/SimplerEnv) and our [INTACT](https://github.com/ai4ce/INT-ACT) Probing Suite for the generalization boundaries of VLA models.
|
19 |
+
|
20 |
+
**Paper**: [From Intention to Execution: Probing the Generalization Boundaries of Vision-Language-Action Models](https://arxiv.org/abs/2506.09930)
|
21 |
+
|
22 |
+
## Model Details
|
23 |
+
|
24 |
+
- **Base Model**: [lerobot/pi0](https://huggingface.co/lerobot/pi0)
|
25 |
+
- **Training Dataset**: [BridgeV2](https://rail-berkeley.github.io/bridgedata/)
|
26 |
+
- **Model Type**: Vision-Language-Action (VLA) model for robotics
|
27 |
+
- **Fine-tuning Method**: See our [paper](https://arxiv.org/abs/2506.09930)
|
28 |
+
- **Training Framework**: See our [repository](https://github.com/ai4ce/INT-ACT)
|
29 |
+
|
30 |
+
|
31 |
+
|
32 |
+
## Quick Start
|
33 |
+
|
34 |
+
|
35 |
+
### Usage in INTACT
|
36 |
+
|
37 |
+
```shell
|
38 |
+
git clone --recurse-submodules https://github.com/ai4ce/INT-ACT.git
|
39 |
+
cd INT-ACT
|
40 |
+
uv sync
|
41 |
+
source .venv/bin/activate
|
42 |
+
python
|
43 |
+
```
|
44 |
+
Or directly in python with Lerobot, see blow:
|
45 |
+
|
46 |
+
### Integration with LeRobot
|
47 |
+
|
48 |
+
First, install lerobot
|
49 |
+
```bash
|
50 |
+
pip install lerobot
|
51 |
+
```
|
52 |
+
Then
|
53 |
+
|
54 |
+
```python
|
55 |
+
import torch
|
56 |
+
from lerobot.common.policies.pi0.modeling_pi0 import Pi0Policy
|
57 |
+
|
58 |
+
# Load model
|
59 |
+
policy = Pi0Policy.from_pretrained("juexzz/INTACT-pi0-scratch-bridge")
|
60 |
+
|
61 |
+
# Inference
|
62 |
+
with torch.no_grad():
|
63 |
+
actions = policy.select_action(batch)
|
64 |
+
```
|
65 |
+
|
66 |
+
|
67 |
+
### Training Configuration
|
68 |
+
- **Training Steps**: 15 epochs ~22695 steps.
|
69 |
+
- **Batch Size**: 1024
|
70 |
+
- **Learning Rate**: 1e-5
|
71 |
+
- **Hardware**: 4 H100/A100
|
72 |
+
- **Input Modalities**: single image (to work with SimplerEnv), 1 language instruction, 1 robot state.
|
73 |
+
- **Output**: robot actions (delta EEF) with chunk size of 4.
|
74 |
+
For more details please refer to our [paper](https://arxiv.org/abs/2506.09930) and [code](https://github.com/ai4ce/INT-ACT)
|
75 |
+
|
76 |
+
|
77 |
+
## Evaluation
|
78 |
+
|
79 |
+
**Checkpoint choice**
|
80 |
+
After training 15 epochs, we sweep the checkpoint at epoch 1, 2, 3, 4, 5, 10, 15 for performance on the original 4 Bridge tasks in the SimplerEnv, and choose the checkpoint with *best average performance* for each of the three Pi0 variants.
|
81 |
+
Therefore, you may still get a better success rate for a specific task at other checkpoints.
|
82 |
+
As a result, the best checkpoint for this pi0 finetune model is at step 22695 (epoch 15).
|
83 |
+
|
84 |
+
The comparison of their performance on Simpler are shown below.
|
85 |
+
|
86 |
+
### Performance Comparison on SimplerEnv
|
87 |
+
|
88 |
+
**Success rate** comparison on the SimplerEnv with other pi0 variants and some other baselines experimented in our INTACT suite.
|
89 |
+
For a more detailed comparison, please refer to the [paper](https://arxiv.org/abs/2506.09930).
|
90 |
+
|
91 |
+
|
92 |
+
| Model | carrot_on_plate | eggplant_in_basket | stack_cube | spoon_on_towel |
|
93 |
+
|-------|-----------------|-------------------|------------|----------------|
|
94 |
+
| [Pi0 finetune](https://huggingface.co/juexzz/INTACT-pi0-finetune-bridge) | 0.361 | 0.819 | 0.264 | 0.458 |
|
95 |
+
| [Pi0 finetune rephrase](https://huggingface.co/juexzz/INTACT-pi0-finetune-rephrase-bridge) | 0.500 | 0.944 | 0.222 | 0.597 |
|
96 |
+
| **Pi0 scratch(this model)** | 0.542 | 0.903 | 0.403 | 0.875 |
|
97 |
+
| Spatial VLA | 0.125 | 0.958 | 0.292 | 0.208 |
|
98 |
+
| Magma | 0.250 | 0.611 | 0.097 | 0.208 |
|
99 |
+
| Octo Small | 0.014 | 0.097 | 0.000 | 0.097 |
|
100 |
+
| Octo Base | 0.014 | 0.306 | 0.000 | 0.014 |
|
101 |
+
|
102 |
+
|
103 |
+
|
104 |
+
|
105 |
+
## Citation
|
106 |
+
|
107 |
+
If you use this model in your research, please cite:
|
108 |
+
|
109 |
+
```bibtex
|
110 |
+
@article{fang2025intention,
|
111 |
+
title={From Intention to Execution: Probing the Generalization Boundaries of Vision-Language-Action Models},
|
112 |
+
author={Fang, Irving and Zhang, Juexiao and Tong, Shengbang and Feng, Chen},
|
113 |
+
journal={arXiv preprint arXiv:2506.09930},
|
114 |
+
year={2025}
|
115 |
+
}
|
116 |
+
```
|
117 |
+
|
118 |
+
## Related Work
|
119 |
+
|
120 |
+
- **Pi0 (official)**: [pi0 (JAX)](https://github.com/Physical-Intelligence/openpi)
|
121 |
+
- **Base Model (Pi0 HF)**: [lerobot/pi0](https://huggingface.co/lerobot/pi0)
|
122 |
+
- **Dataset**: [BridgeV2](https://bridge-v2.github.io/)
|
123 |
+
- **Framework**: [LeRobot](https://github.com/huggingface/lerobot)
|
124 |
+
- **Simpler Environment**: [SimplerEnv](https://github.com/simpler-env/SimplerEnv)
|
125 |
+
- **Open-source Pi0 Implementation by Allen Ren**: [open-pi-zero](https://github.com/allenzren/open-pi-zero)
|
126 |
+
|
127 |
+
## License
|
128 |
+
|
129 |
+
This model is released under the Apache 2.0 license. Please see the base model's license for any additional restrictions.
|
130 |
+
|
131 |
+
## Support
|
132 |
+
|
133 |
+
For questions about this model:
|
134 |
+
- 📧 Open an issue in this repository
|
135 |
+
- 💬 Discussion tab for community questions
|
136 |
+
- 📖 Check our [paper](https://arxiv.org/abs/2506.09930) for technical details
|
137 |
+
|
138 |
+
---
|
139 |
+
|
140 |
+
*Last updated: June 2025*
|