VLA-Adapter commited on
Commit
45caa9b
·
verified ·
1 Parent(s): a04fd50

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +91 -152
README.md CHANGED
@@ -21,6 +21,7 @@ VLA-Adapter: An Effective Paradigm for Tiny-Scale Vision-Language-Action Model t
21
  - 💬 Project page: [https://vla-adapter.github.io/](https://vla-adapter.github.io/)
22
  - 🖥️ Dataset: [https://huggingface.co/datasets/openvla/modified_libero_rlds/tree/main](https://huggingface.co/datasets/openvla/modified_libero_rlds/tree/main)
23
  - 🤗 HuggingFace: [https://huggingface.co/VLA-Adapter](https://huggingface.co/VLA-Adapter)
 
24
 
25
  ## Model Details
26
  We have developed and released the VLA-Adapter family of VLA models, a series of fine-tuned generative
@@ -43,196 +44,134 @@ This resulted in a high-performance VLA model on a tiny-scale backbone.
43
  ### Success Rate Comparison
44
  <table>
45
  <tr>
46
- <td><strong>Category</strong>
47
- </td>
48
- <td><strong>Methods</strong>
49
- </td>
50
- <td><strong>Scale</strong>
51
- </td>
52
- <td><strong>LIBERO-Spatial</strong>
53
- </td>
54
- <td><strong>LIBERO-Object</strong>
55
- </td>
56
- <td><strong>LIBERO-Goal</strong>
57
- </td>
58
- <td><strong>LIBERO-Long</strong>
59
- </td>
60
- <td><strong>Avg.</strong>
61
- </td>
62
- </tr>
63
- <tr>
64
- <td rowspan="11">Large-scale</td>
65
- <td>FlowVLA (Zhong et al., 2025)</td>
66
- <td>8.5B</td><td>93.2</td><td>95.0</td><td>91.6</td><td>72.6</td><td>88.1</td>
67
  </tr>
68
 
69
- <tr>
70
- <td>UnifiedVLA (Wang et al., 2025)</td>
71
- <td>8.5B</td><td>95.4</td><td> <i><u>98.8*</u></i></td><td> 93.6 </td><td>94.0 </td><td>95.5</td>
72
- </tr>
73
 
74
- <tr>
75
- <td>OpenVLA (Kim et al., 2024)</td>
76
- <td>7B</td><td>84.7</td><td>88.4</td><td>79.2</td><td>53.7</td><td>76.5</td>
77
- </tr>
78
 
79
- <tr>
80
- <td>OpenVLA-OFT (Kim et al., 2025)</td>
81
- <td>7B</td><td><i><u>97.6*</u></i></td><td>98.4</td><td><b>97.9</b></td><td><i><u>94.5*</u></i></td><td><i><u>97.1*</u></i></td>
82
- </tr>
83
 
84
- <tr>
85
- <td>UniVLA (Bu et al., 2025)</td>
86
- <td>7B</td><td>96.5</td><td> 96.8</td><td> 95.6 </td><td>92.0 </td><td>95.2</td>
87
- </tr>
88
 
89
- <tr>
90
- <td>CoT-VLA (Zhao et al., 2025)</td>
91
- <td>7B</td><td>87.5 </td><td>91.6 </td><td>87.6</td><td> 69.0</td><td> 81.1</td>
92
- </tr>
93
 
94
- <tr>
95
- <td>WorldVLA (Cen et al., 2025)</td>
96
- <td>7B</td><td>87.6</td><td> 96.2</td><td> 83.4</td><td> 60.0</td><td> 81.8</td>
97
- </tr>
98
 
99
- <tr>
100
- <td>TraceVLA (Zheng et al., 2025)</td>
101
- <td>7B</td><td>84.6</td><td> 85.2</td><td> 75.1</td><td> 54.1</td><td> 74.8</td>
102
- </tr>
103
 
104
- <tr>
105
- <td>MolmoAct (Lee et al., 2025)</td>
106
- <td>7B</td><td>87.0</td><td> 95.4 </td><td>87.6</td><td> 77.2 </td><td>86.6</td>
107
- </tr>
108
 
109
- <tr>
110
- <td>ThinkAct (Huang et al., 2025)</td>
111
- <td>7B</td><td>88.3 </td><td>91.4</td><td> 87.1</td><td> 70.9</td><td> 84.4</td>
112
- </tr>
113
 
114
- <tr>
115
- <td>PD-VLA (Song et al., 2025b)</td>
116
- <td>7B</td><td>95.5 </td><td>96.7</td><td> 94.9</td><td> 91.7</td><td> 94.7</td>
117
- </tr>
118
 
119
- <tr>
120
- <td rowspan="8">Small-scale</td>
121
- <td>4D-VLA (Zhang et al., 2025)</td>
122
- <td>4B</td><td>88.9</td><td> 95.2</td><td> 90.9</td><td> 79.1 </td><td>88.6</td>
123
- </tr>
124
 
125
- <tr>
126
- <td>SpatialVLA (Qu et al., 2025)</td>
127
- <td>4B</td><td>88.2</td><td> 89.9</td><td> 78.6</td><td> 55.5 </td><td>78.1</td>
128
- </tr>
129
 
130
- <tr>
131
- <td>π0 (Black et al., 2025)</td>
132
- <td>3B</td><td>96.8</td><td> <i><u>98.8*</u></i> </td><td>95.8</td><td> 85.2</td><td> 94.2</td>
133
- </tr>
134
 
135
- <tr>
136
- <td>π0-FAST (Pertsch et al., 2025)</td>
137
- <td>3B</td><td>96.4</td><td> 96.8 </td><td>88.6</td><td> 60.2</td><td> 85.5</td>
138
- </tr>
139
 
140
- <tr>
141
- <td>NORA (Hung et al., 2025)</td>
142
- <td>3B</td><td>92.2 </td><td>95.4 </td><td>89.4</td><td> 74.6 </td><td>87.9</td>
143
- </tr>
144
 
145
- <tr>
146
- <td>SmolVLA (Shukor et al., 2025)</td>
147
- <td>2.2B</td><td>93.0</td><td> 94.0 </td><td>91.0</td><td> 77.0 </td><td>88.8</td>
148
- </tr>
149
 
150
- <tr>
151
- <td>GR00T N1 (NVIDIA et al., 2025)</td>
152
- <td>2B</td><td>94.4</td><td> 97.6 </td><td>93.0 </td><td>90.6</td><td> 93.9</td>
153
- </tr>
154
 
155
- <tr>
156
- <td>GraspVLA (Deng et al., 2025)</td>
157
- <td>1.8B</td><td>-</td><td> 94.1 </td><td>91.2 </td><td>82.0</td><td> 89.1</td>
158
- </tr>
159
 
160
- <tr>
161
- <td rowspan="4">Tiny-scale</td>
162
- <td>Seer (Tian et al., 2025)</td>
163
- <td>0.57B</td><td>-</td><td> - </td><td>- </td><td>78.7</td><td> 78.7</td>
164
- </tr>
165
 
166
- <tr>
167
- <td>VLA-OS (Gao et al., 2025)</td>
168
- <td>0.5B</td><td>87.0 </td><td>96.5</td><td> 92.7 </td><td>66.0</td><td> 85.6</td>
169
- </tr>
170
 
171
- <tr>
172
- <td>Diffusion Policy (Chi et al., 2023)</td>
173
- <td>-</td><td>78.3</td><td> 92.5</td><td> 68.3 </td><td>50.5 </td><td>72.4</td>
174
- </tr>
175
 
176
- <tr>
177
- <td><b>VLA-Adapter (Ours)</b></td>
178
- <td><b>0.5B</b></td><td><b>97.8</b></td><td> <b>99.2</b> </td><td><i><u>97.2*</u></i></td><td> <b>95.0</b></td><td><b>97.3</b></td>
179
- </tr>
180
 
181
  </table>
182
 
183
- ### Effectiveness Comparison
184
 
185
  <table>
186
  <tr>
187
- <td></td>
188
- <td><strong>OpenVLA-OFT</strong></td>
189
- <td><strong>VLA-Adapter</strong></td>
190
- <td></td>
191
  </tr>
192
 
193
- <tr>
194
- <td>Backbone</td>
195
- <td>7B</td>
196
- <td><strong>0.5B</strong></td>
197
- <td>1/14×</td>
198
- </tr>
199
 
200
- <tr>
201
- <td>Fine-Tuning Cost</td>
202
- <td>304GPU·h</td>
203
- <td><strong>8GPU·h</strong></td>
204
- <td>1/38×</td>
205
- </tr>
206
-
207
- <tr>
208
- <td>Training VRAM (8 batch)</td>
209
- <td>62GB</td>
210
- <td><strong>24.7GB</strong></td>
211
- <td>0.4×</td>
212
- </tr>
213
 
214
- <tr>
215
- <td>Throughput (8 chunk)</td>
216
- <td>71.4Hz</td>
217
- <td><strong>219.2Hz</strong></td>
218
- <td>3×</td>
219
- </tr>
220
 
221
- <tr>
222
- <td>Performance</td>
223
- <td>97.1%</td>
224
- <td><strong>97.3%</strong></td>
225
- <td>Maintain</td>
226
- </tr>
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
227
  </table>
228
 
229
  ## Citation instructions
230
 
231
  ```BibTeX
232
- @article{Wang2025VLAAdapter,
233
- author = {Wang, Yihao and Ding, Pengxiang and Li, Lingxiao and Cui, Can and Ge, Zirui and Tong, Xinyang and Song, Wenxuan and Zhao, Han and Zhao, Wei and Hou, Pengxu and Huang, Siteng and Tang, Yifan and Wang, Wenhui and Zhang, Ru and Liu, Jianyi and Wang, Donglin},
234
- title = {VLA-Adapter: An Effective Paradigm for Tiny-Scale Vision-Language-Action Model},
235
- journal = {ArXiv},
236
- year = {2025}
237
  }
238
  ```
 
21
  - 💬 Project page: [https://vla-adapter.github.io/](https://vla-adapter.github.io/)
22
  - 🖥️ Dataset: [https://huggingface.co/datasets/openvla/modified_libero_rlds/tree/main](https://huggingface.co/datasets/openvla/modified_libero_rlds/tree/main)
23
  - 🤗 HuggingFace: [https://huggingface.co/VLA-Adapter](https://huggingface.co/VLA-Adapter)
24
+ - Github: [https://github.com/OpenHelix-Team/VLA-Adapter](https://github.com/OpenHelix-Team/VLA-Adapter)
25
 
26
  ## Model Details
27
  We have developed and released the VLA-Adapter family of VLA models, a series of fine-tuned generative
 
44
  ### Success Rate Comparison
45
  <table>
46
  <tr>
47
+ <td><strong>LIBERO</strong></td> <td><strong>Methods</strong></td>
48
+ <td><strong>Scale</strong></td> <td><strong>Spatial</strong></td>
49
+ <td><strong>Object</strong></td> <td><strong>Goal</strong></td>
50
+ <td><strong>Long</strong></td> <td><strong>Avg.</strong></td>
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
51
  </tr>
52
 
53
+ <tr><td rowspan="10">Large-scale</td><td>FlowVLA (Zhong et al., 2025)</td>
54
+ <td>8.5B</td><td>93.2</td><td>95.0</td><td>91.6</td><td>72.6</td><td>88.1</td></tr>
 
 
55
 
56
+ <tr><td>UnifiedVLA (Wang et al., 2025)</td>
57
+ <td>8.5B</td><td>95.4</td><td><i><u>98.8*</u></i></td><td> 93.6 </td><td>94.0 </td><td>95.5</td></tr>
 
 
58
 
59
+ <tr><td>OpenVLA (Kim et al., 2024)</td>
60
+ <td>7B</td><td>84.7</td><td>88.4</td><td>79.2</td><td>53.7</td><td>76.5</td></tr>
 
 
61
 
62
+ <tr><td>OpenVLA-OFT (Kim et al., 2025)</td>
63
+ <td>7B</td><td><i><u>97.6*</u></i></td><td>98.4</td><td><b>97.9</b></td><td><i><u>94.5*</u></i></td><td><i><u>97.1*</u></i></td></tr>
 
 
64
 
65
+ <tr><td>UniVLA (Bu et al., 2025)</td>
66
+ <td>7B</td><td>96.5</td><td> 96.8</td><td> 95.6 </td><td>92.0 </td><td>95.2</td></tr>
 
 
67
 
68
+ <tr><td>CoT-VLA (Zhao et al., 2025)</td>
69
+ <td>7B</td><td>87.5 </td><td>91.6 </td><td>87.6</td><td> 69.0</td><td> 81.1</td></tr>
 
 
70
 
71
+ <tr><td>WorldVLA (Cen et al., 2025)</td>
72
+ <td>7B</td><td>87.6</td><td> 96.2</td><td> 83.4</td><td> 60.0</td><td> 81.8</td></tr>
 
 
73
 
74
+ <tr><td>TraceVLA (Zheng et al., 2025)</td>
75
+ <td>7B</td><td>84.6</td><td> 85.2</td><td> 75.1</td><td> 54.1</td><td> 74.8</td></tr>
 
 
76
 
77
+ <tr><td>MolmoAct (Lee et al., 2025)</td>
78
+ <td>7B</td><td>87.0</td><td> 95.4 </td><td>87.6</td><td> 77.2 </td><td>86.6</td></tr>
 
 
79
 
80
+ <tr><td>ThinkAct (Huang et al., 2025)</td>
81
+ <td>7B</td><td>88.3 </td><td>91.4</td><td> 87.1</td><td> 70.9</td><td> 84.4</td></tr>
 
 
82
 
83
+ <tr><td rowspan="7">Small-scale</td><td>4D-VLA (Zhang et al., 2025)</td>
84
+ <td>4B</td><td>88.9</td><td> 95.2</td><td> 90.9</td><td> 79.1 </td><td>88.6</td></tr>
 
 
 
85
 
86
+ <tr><td>SpatialVLA (Qu et al., 2025)</td>
87
+ <td>4B</td><td>88.2</td><td> 89.9</td><td> 78.6</td><td> 55.5 </td><td>78.1</td></tr>
 
 
88
 
89
+ <tr><td>π0 (Black et al., 2024)</td>
90
+ <td>3B</td><td>96.8</td><td><i><u>98.8*</u></i></td><td>95.8</td><td> 85.2</td><td> 94.2</td></tr>
 
 
91
 
92
+ <tr><td>π0-FAST (Pertsch et al., 2025)</td>
93
+ <td>3B</td><td>96.4</td><td> 96.8 </td><td>88.6</td><td> 60.2</td><td> 85.5</td></tr>
 
 
94
 
95
+ <tr><td>NORA (Hung et al., 2025)</td>
96
+ <td>3B</td><td>92.2 </td><td>95.4 </td><td>89.4</td><td> 74.6 </td><td>87.9</td></tr>
 
 
97
 
98
+ <tr><td>SmolVLA (Shukor et al., 2025)</td>
99
+ <td>2.2B</td><td>93.0</td><td> 94.0 </td><td>91.0</td><td> 77.0 </td><td>88.8</td></tr>
 
 
100
 
101
+ <tr><td>GR00T N1 (NVIDIA et al., 2025)</td>
102
+ <td>2B</td><td>94.4</td><td> 97.6 </td><td>93.0 </td><td>90.6</td><td> 93.9</td></tr>
 
 
103
 
104
+ <tr><td rowspan="5">Tiny-scale</td><td>Seer (Tian et al., 2025)</td>
105
+ <td>0.57B</td><td>-</td><td> - </td><td>- </td><td>78.7</td><td> 78.7</td></tr>
 
 
106
 
107
+ <tr><td>VLA-OS (Gao et al., 2025)</td>
108
+ <td>0.5B</td><td>87.0 </td><td>96.5</td><td> 92.7 </td><td>66.0</td><td> 85.6</td></tr>
 
 
 
109
 
110
+ <tr><td>Diffusion Policy (Chi et al., 2023)</td>
111
+ <td>-</td><td>78.3</td><td> 92.5</td><td> 68.3 </td><td>50.5 </td><td>72.4</td></tr>
 
 
112
 
113
+ <tr><td><b>VLA-Adapter (Ours)</b></td>
114
+ <td><b>0.5B</b></td><td><b>97.8</b></td><td><b>99.2</b></td><td><i><u>97.2*</u></i></td><td> <b>95.0 </b></td><td><b>97.3</b></td></tr>
 
 
115
 
116
+ <tr><td><b>VLA-Adapter-Pro (Ours)</b></td>
117
+ <td><b>0.5B</b></td><td><b><i>99.6</i></b></td><td><b><i>99.6</i></b> </td><td><b><i>98.2</i></b></td><td><b><i>96.4</i></b></td><td><b><i>98.5</i></b></td></tr>
 
 
118
 
119
  </table>
120
 
 
121
 
122
  <table>
123
  <tr>
124
+ <td><strong>CALVIN</strong></td> <td><strong>Methods</strong></td>
125
+ <td><strong>Scale</strong></td> <td><strong>1</strong></td>
126
+ <td><strong>2</strong></td> <td><strong>3</strong></td>
127
+ <td><strong>4</strong></td> <td><strong>5</strong></td> <td><strong>Avg. len</strong></td>
128
  </tr>
129
 
130
+ <tr><td rowspan="8">Large-scale</td><td>UniVLA (Bu et al., 2025) </td><td>7B </td><td>95.5 </td><td>85.8 </td><td>75.4</td><td> 66.9 </td><td>56.5 </td><td>3.80</tr>
 
 
 
 
 
131
 
132
+ <tr><td>OpenVLA (Kim et al., 2024) </td><td> 7B</td><td> 91.3</td><td> 77.8 </td><td>62.0 </td><td>52.1 </td><td>43.5</td><td> 3.27</td></tr>
 
 
 
 
 
 
 
 
 
 
 
 
133
 
134
+ <tr><td>OpenVLA-OFT (Kim et al., 2025)</td><td> 7B</td><td> 96.3</td><td> 89.1 </td><td>82.4</td><td> 75.8</td><td> 66.5</td><td> 4.10</td></tr>
 
 
 
 
 
135
 
136
+ <tr><td>VLAS (Zhao et al., 2025b) </td><td> 7B</td><td> 87.2 </td><td>64.2</td><td> 40.9 </td><td>28.1</td><td> 19.6 </td><td>2.40</td></tr>
137
+
138
+ <tr><td>LCB (Shentu et al., 2024) </td><td> 7B</td><td> 73.6 </td><td>50.2 </td><td>28.5 </td><td>16.0 </td><td>9.9 </td><td>1.78</td></tr>
139
+
140
+ <tr><td>RoboDual (Bu et al., 2024a) </td><td> 7B</td><td> 94.4</td><td> 82.7</td><td> 72.1</td><td> 62.4 </td><td>54.4</td><td> 3.66</td></tr>
141
+
142
+ <tr><td>OpenHelix (Cui et al., 2025) </td><td> 7B</td><td> <i><u>97.1*</u></i> </td><td>91.4 </td><td>82.8</td><td> 72.6</td><td> 64.1 </td><td>4.08</td></tr>
143
+
144
+ <tr><td>ReconVLA (Song et al., 2025c) </td><td> 7B</td><td> 95.6 </td><td>87.6 </td><td>76.9</td><td> 69.3</td><td> 64.1 </td><td>3.95</td></tr>
145
+
146
+ <tr><td rowspan="4">Small-scale</td><td>DeeR (Yue et al., 2024) </td><td> 3B</td><td> 86.2</td><td> 70.1 </td><td>51.8</td><td> 41.5</td><td> 30.4 </td><td>2.82</td></tr>
147
+
148
+ <tr><td>RoboFlamingo (Li et al., 2024b) </td><td> 3B</td><td> 82.4 </td><td>61.9</td><td> 46.6 </td><td>33.1</td><td> 23.5</td><td> 2.48</td></tr>
149
+
150
+ <tr><td>VPP (Hu et al., 2025)</td><td> 1.5B</td><td> 95.7</td><td> 91.2</td><td> <i><u>86.3*</u></i></td><td> <i><u>81.0*</u></i></td><td> <i><u>75.0*</u></i></td><td> <i><u>4.33*</u></i></td></tr>
151
+
152
+ <tr><td>SuSIE (Black et al., 2024)</td><td>1.3B</td><td> 87.0</td><td> 69.0</td><td> 49.0 </td><td>38.0</td><td> 26.0</td><td> 2.69</td></tr>
153
+
154
+ <tr><td rowspan="5">Tiny-scale</td><td>Seer-Large (Tian et al., 2025)</td><td>0.57B</td><td> 96.3 </td><td><i><u>91.6*</u></i></td><td> 86.1 </td><td>80.3 </td><td>74.0</td><td> 4.28</td></tr>
155
+
156
+ <tr><td>MoDE (Reuss et al., 2025) </td><td> 0.44B </td><td>96.2</td><td> 88.9</td><td> 81.1</td><td> 71.8 </td><td>63.5 </td><td>4.01</td></tr>
157
+
158
+ <tr><td>Seer (Tian et al., 2025) </td><td> 0.32B</td><td> 94.4 </td><td>87.2 </td><td>79.9 </td><td>72.2 </td><td>64.3</td><td> 3.98</td></tr>
159
+
160
+ <tr><td><b>VLA-Adapter (Ours)</b></td>
161
+ <td><b>0.5B</b></td><td><b><i>99.1</i></b> </td><td><b>94.6</b> </td><td><b>88.8</b></td><td> <b>82.8</b> </td><td><b>76.5</b> </td><td><b>4.42</b></td></tr>
162
+
163
+ <tr><td><b>VLA-Adapter-Pro (Ours)</b></td>
164
+ <td><b>0.5B</b></td><td><b>98.5</b></td><td><b><i>95.0</i></b> </td><td><b><i>90.5</i></b></td><td><b><i>85.3</i></b></td><td><b><i>80.0</i></b></td><td><b><i>4.50</i></b></td></tr>
165
+
166
  </table>
167
 
168
  ## Citation instructions
169
 
170
  ```BibTeX
171
+ @article{wang2025vlaadapter,
172
+ author={Wang, Yihao and Ding, Pengxiang and Li, Lingxiao and Cui, Can and Ge, Zirui and Tong, Xinyang and Song, Wenxuan and Zhao, Han and Zhao, Wei and Hou, Pengxu and Huang, Siteng and Tang, Yifan and Wang, Wenhui and Zhang, Ru and Liu, Jianyi and Wang, Donglin},
173
+ title={VLA-Adapter: An Effective Paradigm for Tiny-Scale Vision-Language-Action Model},
174
+ journal={arXiv preprint arXiv:2509.09372},
175
+ year={2025}
176
  }
177
  ```