File size: 5,278 Bytes
ae40629
 
 
 
 
 
 
85f944c
ae40629
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
---
license: apache-2.0
base_model:
- lerobot/pi0
pipeline_tag: robotics
---

# INTACT Probing Suite: Pi0 from scratch on BridgeV2

> ๐Ÿ“ฆ **This model is part of the [INTACT Probing Suite Collection](https://huggingface.co/collections/ai4ce/intact-probing-suite-684e5601e9ed640fdd9b994b)**  
> Explore other variants:
>  - [Pi0 fintuned on BridgeV2](https://huggingface.co/juexzz/INTACT-pi0-finetune-bridge)
>  - [Pi0 finetuned with paraphrase on BridgeV2](https://huggingface.co/juexzz/INTACT-pi0-finetune-rephrase-bridge)

## INTACT-pi0-scratch-bridge

This repository contains a checkpoint of the Pi0 model ([HF implementation](https://huggingface.co/lerobot/pi0) | [Paper](https://arxiv.org/abs/2410.24164v1)) *initialized from PaliGemma and trained directly ("from scratch")* on the BridgeV2 dataset for robotic manipulation tasks. 
The model is later used for testing on the [Simpler Environment](https://github.com/simpler-env/SimplerEnv) and our [INTACT](https://github.com/ai4ce/INT-ACT) Probing Suite for the generalization boundaries of VLA models.

**Paper**: [From Intention to Execution: Probing the Generalization Boundaries of Vision-Language-Action Models](https://arxiv.org/abs/2506.09930)

## Model Details

- **Base Model**: [lerobot/pi0](https://huggingface.co/lerobot/pi0)
- **Training Dataset**: [BridgeV2](https://rail-berkeley.github.io/bridgedata/)
- **Model Type**: Vision-Language-Action (VLA) model for robotics
- **Fine-tuning Method**: See our [paper](https://arxiv.org/abs/2506.09930)
- **Training Framework**: See our [repository](https://github.com/ai4ce/INT-ACT)



## Quick Start


### Usage in INTACT

```shell
git clone --recurse-submodules https://github.com/ai4ce/INT-ACT.git
cd INT-ACT
uv sync
source .venv/bin/activate
python 
```
Or directly in python with Lerobot, see blow:

### Integration with LeRobot

First, install lerobot
```bash
pip install lerobot
```
Then

```python
import torch
from lerobot.common.policies.pi0.modeling_pi0 import Pi0Policy

# Load model
policy = Pi0Policy.from_pretrained("juexzz/INTACT-pi0-scratch-bridge")

# Inference
with torch.no_grad():
    actions = policy.select_action(batch)
```


### Training Configuration
- **Training Steps**: 15 epochs ~22695 steps.
- **Batch Size**: 1024
- **Learning Rate**: 1e-5
- **Hardware**: 4 H100/A100
- **Input Modalities**: single image (to work with SimplerEnv), 1 language instruction, 1 robot state.
- **Output**: robot actions (delta EEF) with chunk size of 4.
For more details please refer to our [paper](https://arxiv.org/abs/2506.09930) and [code](https://github.com/ai4ce/INT-ACT)


## Evaluation

**Checkpoint choice**
After training 15 epochs, we sweep the checkpoint at epoch 1, 2, 3, 4, 5, 10, 15 for performance on the original 4 Bridge tasks in the SimplerEnv, and choose the checkpoint with *best average performance* for each of the three Pi0 variants.
Therefore, you may still get a better success rate for a specific task at other checkpoints.
As a result, the best checkpoint for this pi0 finetune model is at step 22695 (epoch 15).

The comparison of their performance on Simpler are shown below. 

### Performance Comparison on SimplerEnv

**Success rate** comparison on the SimplerEnv with other pi0 variants and some other baselines experimented in our INTACT suite.
For a more detailed comparison, please refer to the [paper](https://arxiv.org/abs/2506.09930).


| Model | carrot_on_plate | eggplant_in_basket | stack_cube | spoon_on_towel |
|-------|-----------------|-------------------|------------|----------------|
| [Pi0 finetune](https://huggingface.co/juexzz/INTACT-pi0-finetune-bridge) | 0.361 | 0.819 | 0.264 | 0.458 |
| [Pi0 finetune rephrase](https://huggingface.co/juexzz/INTACT-pi0-finetune-rephrase-bridge) | 0.500 | 0.944 | 0.222 | 0.597 |
| **Pi0 scratch(this model)** | 0.542 | 0.903 | 0.403 | 0.875 |
| Spatial VLA | 0.125 | 0.958 | 0.292 | 0.208 |
| Magma | 0.250 | 0.611 | 0.097 | 0.208 |
| Octo Small | 0.014 | 0.097 | 0.000 | 0.097 |
| Octo Base | 0.014 | 0.306 | 0.000 | 0.014 |




## Citation

If you use this model in your research, please cite:

```bibtex
@article{fang2025intention,
  title={From Intention to Execution: Probing the Generalization Boundaries of Vision-Language-Action Models},
  author={Fang, Irving and Zhang, Juexiao and Tong, Shengbang and Feng, Chen},
  journal={arXiv preprint arXiv:2506.09930},
  year={2025}
}
```

## Related Work

- **Pi0 (official)**: [pi0 (JAX)](https://github.com/Physical-Intelligence/openpi)
- **Base Model (Pi0 HF)**: [lerobot/pi0](https://huggingface.co/lerobot/pi0)
- **Dataset**: [BridgeV2](https://bridge-v2.github.io/)
- **Framework**: [LeRobot](https://github.com/huggingface/lerobot)
- **Simpler Environment**: [SimplerEnv](https://github.com/simpler-env/SimplerEnv)
- **Open-source Pi0 Implementation by Allen Ren**: [open-pi-zero](https://github.com/allenzren/open-pi-zero)

## License

This model is released under the Apache 2.0 license. Please see the base model's license for any additional restrictions.

## Support

For questions about this model:
- ๐Ÿ“ง Open an issue in this repository
- ๐Ÿ’ฌ Discussion tab for community questions
- ๐Ÿ“– Check our [paper](https://arxiv.org/abs/2506.09930) for technical details

---

*Last updated: June 2025*