Update README.md
Browse files
README.md
CHANGED
@@ -21,6 +21,7 @@ VLA-Adapter: An Effective Paradigm for Tiny-Scale Vision-Language-Action Model t
|
|
21 |
- 💬 Project page: [https://vla-adapter.github.io/](https://vla-adapter.github.io/)
|
22 |
- 🖥️ Dataset: [https://huggingface.co/datasets/openvla/modified_libero_rlds/tree/main](https://huggingface.co/datasets/openvla/modified_libero_rlds/tree/main)
|
23 |
- 🤗 HuggingFace: [https://huggingface.co/VLA-Adapter](https://huggingface.co/VLA-Adapter)
|
|
|
24 |
|
25 |
## Model Details
|
26 |
We have developed and released the VLA-Adapter family of VLA models, a series of fine-tuned generative
|
@@ -43,196 +44,134 @@ This resulted in a high-performance VLA model on a tiny-scale backbone.
|
|
43 |
### Success Rate Comparison
|
44 |
<table>
|
45 |
<tr>
|
46 |
-
<td><strong>
|
47 |
-
</td>
|
48 |
-
<td><strong>
|
49 |
-
</td>
|
50 |
-
<td><strong>Scale</strong>
|
51 |
-
</td>
|
52 |
-
<td><strong>LIBERO-Spatial</strong>
|
53 |
-
</td>
|
54 |
-
<td><strong>LIBERO-Object</strong>
|
55 |
-
</td>
|
56 |
-
<td><strong>LIBERO-Goal</strong>
|
57 |
-
</td>
|
58 |
-
<td><strong>LIBERO-Long</strong>
|
59 |
-
</td>
|
60 |
-
<td><strong>Avg.</strong>
|
61 |
-
</td>
|
62 |
-
</tr>
|
63 |
-
<tr>
|
64 |
-
<td rowspan="11">Large-scale</td>
|
65 |
-
<td>FlowVLA (Zhong et al., 2025)</td>
|
66 |
-
<td>8.5B</td><td>93.2</td><td>95.0</td><td>91.6</td><td>72.6</td><td>88.1</td>
|
67 |
</tr>
|
68 |
|
69 |
-
<tr>
|
70 |
-
<td>
|
71 |
-
<td>8.5B</td><td>95.4</td><td> <i><u>98.8*</u></i></td><td> 93.6 </td><td>94.0 </td><td>95.5</td>
|
72 |
-
</tr>
|
73 |
|
74 |
-
<tr>
|
75 |
-
<td>
|
76 |
-
<td>7B</td><td>84.7</td><td>88.4</td><td>79.2</td><td>53.7</td><td>76.5</td>
|
77 |
-
</tr>
|
78 |
|
79 |
-
<tr>
|
80 |
-
<td>
|
81 |
-
<td>7B</td><td><i><u>97.6*</u></i></td><td>98.4</td><td><b>97.9</b></td><td><i><u>94.5*</u></i></td><td><i><u>97.1*</u></i></td>
|
82 |
-
</tr>
|
83 |
|
84 |
-
<tr>
|
85 |
-
<td>
|
86 |
-
<td>7B</td><td>96.5</td><td> 96.8</td><td> 95.6 </td><td>92.0 </td><td>95.2</td>
|
87 |
-
</tr>
|
88 |
|
89 |
-
<tr>
|
90 |
-
<td>
|
91 |
-
<td>7B</td><td>87.5 </td><td>91.6 </td><td>87.6</td><td> 69.0</td><td> 81.1</td>
|
92 |
-
</tr>
|
93 |
|
94 |
-
<tr>
|
95 |
-
<td>
|
96 |
-
<td>7B</td><td>87.6</td><td> 96.2</td><td> 83.4</td><td> 60.0</td><td> 81.8</td>
|
97 |
-
</tr>
|
98 |
|
99 |
-
<tr>
|
100 |
-
<td>
|
101 |
-
<td>7B</td><td>84.6</td><td> 85.2</td><td> 75.1</td><td> 54.1</td><td> 74.8</td>
|
102 |
-
</tr>
|
103 |
|
104 |
-
<tr>
|
105 |
-
<td>
|
106 |
-
<td>7B</td><td>87.0</td><td> 95.4 </td><td>87.6</td><td> 77.2 </td><td>86.6</td>
|
107 |
-
</tr>
|
108 |
|
109 |
-
<tr>
|
110 |
-
<td>
|
111 |
-
<td>7B</td><td>88.3 </td><td>91.4</td><td> 87.1</td><td> 70.9</td><td> 84.4</td>
|
112 |
-
</tr>
|
113 |
|
114 |
-
<tr>
|
115 |
-
<td>
|
116 |
-
<td>7B</td><td>95.5 </td><td>96.7</td><td> 94.9</td><td> 91.7</td><td> 94.7</td>
|
117 |
-
</tr>
|
118 |
|
119 |
-
<tr>
|
120 |
-
|
121 |
-
<td>4D-VLA (Zhang et al., 2025)</td>
|
122 |
-
<td>4B</td><td>88.9</td><td> 95.2</td><td> 90.9</td><td> 79.1 </td><td>88.6</td>
|
123 |
-
</tr>
|
124 |
|
125 |
-
<tr>
|
126 |
-
<td>
|
127 |
-
<td>4B</td><td>88.2</td><td> 89.9</td><td> 78.6</td><td> 55.5 </td><td>78.1</td>
|
128 |
-
</tr>
|
129 |
|
130 |
-
<tr>
|
131 |
-
<td
|
132 |
-
<td>3B</td><td>96.8</td><td> <i><u>98.8*</u></i> </td><td>95.8</td><td> 85.2</td><td> 94.2</td>
|
133 |
-
</tr>
|
134 |
|
135 |
-
<tr>
|
136 |
-
<td
|
137 |
-
<td>3B</td><td>96.4</td><td> 96.8 </td><td>88.6</td><td> 60.2</td><td> 85.5</td>
|
138 |
-
</tr>
|
139 |
|
140 |
-
<tr>
|
141 |
-
<td>
|
142 |
-
<td>3B</td><td>92.2 </td><td>95.4 </td><td>89.4</td><td> 74.6 </td><td>87.9</td>
|
143 |
-
</tr>
|
144 |
|
145 |
-
<tr>
|
146 |
-
<td>
|
147 |
-
<td>2.2B</td><td>93.0</td><td> 94.0 </td><td>91.0</td><td> 77.0 </td><td>88.8</td>
|
148 |
-
</tr>
|
149 |
|
150 |
-
<tr>
|
151 |
-
<td>
|
152 |
-
<td>2B</td><td>94.4</td><td> 97.6 </td><td>93.0 </td><td>90.6</td><td> 93.9</td>
|
153 |
-
</tr>
|
154 |
|
155 |
-
<tr>
|
156 |
-
<td>
|
157 |
-
<td>1.8B</td><td>-</td><td> 94.1 </td><td>91.2 </td><td>82.0</td><td> 89.1</td>
|
158 |
-
</tr>
|
159 |
|
160 |
-
<tr>
|
161 |
-
|
162 |
-
<td>Seer (Tian et al., 2025)</td>
|
163 |
-
<td>0.57B</td><td>-</td><td> - </td><td>- </td><td>78.7</td><td> 78.7</td>
|
164 |
-
</tr>
|
165 |
|
166 |
-
<tr>
|
167 |
-
<td>
|
168 |
-
<td>0.5B</td><td>87.0 </td><td>96.5</td><td> 92.7 </td><td>66.0</td><td> 85.6</td>
|
169 |
-
</tr>
|
170 |
|
171 |
-
<tr>
|
172 |
-
<td>
|
173 |
-
<td>-</td><td>78.3</td><td> 92.5</td><td> 68.3 </td><td>50.5 </td><td>72.4</td>
|
174 |
-
</tr>
|
175 |
|
176 |
-
<tr>
|
177 |
-
<td><b>
|
178 |
-
<td><b>0.5B</b></td><td><b>97.8</b></td><td> <b>99.2</b> </td><td><i><u>97.2*</u></i></td><td> <b>95.0</b></td><td><b>97.3</b></td>
|
179 |
-
</tr>
|
180 |
|
181 |
</table>
|
182 |
|
183 |
-
### Effectiveness Comparison
|
184 |
|
185 |
<table>
|
186 |
<tr>
|
187 |
-
<td></td>
|
188 |
-
<td><strong>
|
189 |
-
<td><strong>
|
190 |
-
<td></td>
|
191 |
</tr>
|
192 |
|
193 |
-
<tr>
|
194 |
-
<td>Backbone</td>
|
195 |
-
<td>7B</td>
|
196 |
-
<td><strong>0.5B</strong></td>
|
197 |
-
<td>1/14×</td>
|
198 |
-
</tr>
|
199 |
|
200 |
-
<tr>
|
201 |
-
<td>Fine-Tuning Cost</td>
|
202 |
-
<td>304GPU·h</td>
|
203 |
-
<td><strong>8GPU·h</strong></td>
|
204 |
-
<td>1/38×</td>
|
205 |
-
</tr>
|
206 |
-
|
207 |
-
<tr>
|
208 |
-
<td>Training VRAM (8 batch)</td>
|
209 |
-
<td>62GB</td>
|
210 |
-
<td><strong>24.7GB</strong></td>
|
211 |
-
<td>0.4×</td>
|
212 |
-
</tr>
|
213 |
|
214 |
-
<tr>
|
215 |
-
<td>Throughput (8 chunk)</td>
|
216 |
-
<td>71.4Hz</td>
|
217 |
-
<td><strong>219.2Hz</strong></td>
|
218 |
-
<td>3×</td>
|
219 |
-
</tr>
|
220 |
|
221 |
-
<tr>
|
222 |
-
|
223 |
-
|
224 |
-
|
225 |
-
|
226 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
227 |
</table>
|
228 |
|
229 |
## Citation instructions
|
230 |
|
231 |
```BibTeX
|
232 |
-
@article{
|
233 |
-
author
|
234 |
-
title
|
235 |
-
journal
|
236 |
-
year
|
237 |
}
|
238 |
```
|
|
|
21 |
- 💬 Project page: [https://vla-adapter.github.io/](https://vla-adapter.github.io/)
|
22 |
- 🖥️ Dataset: [https://huggingface.co/datasets/openvla/modified_libero_rlds/tree/main](https://huggingface.co/datasets/openvla/modified_libero_rlds/tree/main)
|
23 |
- 🤗 HuggingFace: [https://huggingface.co/VLA-Adapter](https://huggingface.co/VLA-Adapter)
|
24 |
+
- Github: [https://github.com/OpenHelix-Team/VLA-Adapter](https://github.com/OpenHelix-Team/VLA-Adapter)
|
25 |
|
26 |
## Model Details
|
27 |
We have developed and released the VLA-Adapter family of VLA models, a series of fine-tuned generative
|
|
|
44 |
### Success Rate Comparison
|
45 |
<table>
|
46 |
<tr>
|
47 |
+
<td><strong>LIBERO</strong></td> <td><strong>Methods</strong></td>
|
48 |
+
<td><strong>Scale</strong></td> <td><strong>Spatial</strong></td>
|
49 |
+
<td><strong>Object</strong></td> <td><strong>Goal</strong></td>
|
50 |
+
<td><strong>Long</strong></td> <td><strong>Avg.</strong></td>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
51 |
</tr>
|
52 |
|
53 |
+
<tr><td rowspan="10">Large-scale</td><td>FlowVLA (Zhong et al., 2025)</td>
|
54 |
+
<td>8.5B</td><td>93.2</td><td>95.0</td><td>91.6</td><td>72.6</td><td>88.1</td></tr>
|
|
|
|
|
55 |
|
56 |
+
<tr><td>UnifiedVLA (Wang et al., 2025)</td>
|
57 |
+
<td>8.5B</td><td>95.4</td><td><i><u>98.8*</u></i></td><td> 93.6 </td><td>94.0 </td><td>95.5</td></tr>
|
|
|
|
|
58 |
|
59 |
+
<tr><td>OpenVLA (Kim et al., 2024)</td>
|
60 |
+
<td>7B</td><td>84.7</td><td>88.4</td><td>79.2</td><td>53.7</td><td>76.5</td></tr>
|
|
|
|
|
61 |
|
62 |
+
<tr><td>OpenVLA-OFT (Kim et al., 2025)</td>
|
63 |
+
<td>7B</td><td><i><u>97.6*</u></i></td><td>98.4</td><td><b>97.9</b></td><td><i><u>94.5*</u></i></td><td><i><u>97.1*</u></i></td></tr>
|
|
|
|
|
64 |
|
65 |
+
<tr><td>UniVLA (Bu et al., 2025)</td>
|
66 |
+
<td>7B</td><td>96.5</td><td> 96.8</td><td> 95.6 </td><td>92.0 </td><td>95.2</td></tr>
|
|
|
|
|
67 |
|
68 |
+
<tr><td>CoT-VLA (Zhao et al., 2025)</td>
|
69 |
+
<td>7B</td><td>87.5 </td><td>91.6 </td><td>87.6</td><td> 69.0</td><td> 81.1</td></tr>
|
|
|
|
|
70 |
|
71 |
+
<tr><td>WorldVLA (Cen et al., 2025)</td>
|
72 |
+
<td>7B</td><td>87.6</td><td> 96.2</td><td> 83.4</td><td> 60.0</td><td> 81.8</td></tr>
|
|
|
|
|
73 |
|
74 |
+
<tr><td>TraceVLA (Zheng et al., 2025)</td>
|
75 |
+
<td>7B</td><td>84.6</td><td> 85.2</td><td> 75.1</td><td> 54.1</td><td> 74.8</td></tr>
|
|
|
|
|
76 |
|
77 |
+
<tr><td>MolmoAct (Lee et al., 2025)</td>
|
78 |
+
<td>7B</td><td>87.0</td><td> 95.4 </td><td>87.6</td><td> 77.2 </td><td>86.6</td></tr>
|
|
|
|
|
79 |
|
80 |
+
<tr><td>ThinkAct (Huang et al., 2025)</td>
|
81 |
+
<td>7B</td><td>88.3 </td><td>91.4</td><td> 87.1</td><td> 70.9</td><td> 84.4</td></tr>
|
|
|
|
|
82 |
|
83 |
+
<tr><td rowspan="7">Small-scale</td><td>4D-VLA (Zhang et al., 2025)</td>
|
84 |
+
<td>4B</td><td>88.9</td><td> 95.2</td><td> 90.9</td><td> 79.1 </td><td>88.6</td></tr>
|
|
|
|
|
|
|
85 |
|
86 |
+
<tr><td>SpatialVLA (Qu et al., 2025)</td>
|
87 |
+
<td>4B</td><td>88.2</td><td> 89.9</td><td> 78.6</td><td> 55.5 </td><td>78.1</td></tr>
|
|
|
|
|
88 |
|
89 |
+
<tr><td>π0 (Black et al., 2024)</td>
|
90 |
+
<td>3B</td><td>96.8</td><td><i><u>98.8*</u></i></td><td>95.8</td><td> 85.2</td><td> 94.2</td></tr>
|
|
|
|
|
91 |
|
92 |
+
<tr><td>π0-FAST (Pertsch et al., 2025)</td>
|
93 |
+
<td>3B</td><td>96.4</td><td> 96.8 </td><td>88.6</td><td> 60.2</td><td> 85.5</td></tr>
|
|
|
|
|
94 |
|
95 |
+
<tr><td>NORA (Hung et al., 2025)</td>
|
96 |
+
<td>3B</td><td>92.2 </td><td>95.4 </td><td>89.4</td><td> 74.6 </td><td>87.9</td></tr>
|
|
|
|
|
97 |
|
98 |
+
<tr><td>SmolVLA (Shukor et al., 2025)</td>
|
99 |
+
<td>2.2B</td><td>93.0</td><td> 94.0 </td><td>91.0</td><td> 77.0 </td><td>88.8</td></tr>
|
|
|
|
|
100 |
|
101 |
+
<tr><td>GR00T N1 (NVIDIA et al., 2025)</td>
|
102 |
+
<td>2B</td><td>94.4</td><td> 97.6 </td><td>93.0 </td><td>90.6</td><td> 93.9</td></tr>
|
|
|
|
|
103 |
|
104 |
+
<tr><td rowspan="5">Tiny-scale</td><td>Seer (Tian et al., 2025)</td>
|
105 |
+
<td>0.57B</td><td>-</td><td> - </td><td>- </td><td>78.7</td><td> 78.7</td></tr>
|
|
|
|
|
106 |
|
107 |
+
<tr><td>VLA-OS (Gao et al., 2025)</td>
|
108 |
+
<td>0.5B</td><td>87.0 </td><td>96.5</td><td> 92.7 </td><td>66.0</td><td> 85.6</td></tr>
|
|
|
|
|
|
|
109 |
|
110 |
+
<tr><td>Diffusion Policy (Chi et al., 2023)</td>
|
111 |
+
<td>-</td><td>78.3</td><td> 92.5</td><td> 68.3 </td><td>50.5 </td><td>72.4</td></tr>
|
|
|
|
|
112 |
|
113 |
+
<tr><td><b>VLA-Adapter (Ours)</b></td>
|
114 |
+
<td><b>0.5B</b></td><td><b>97.8</b></td><td><b>99.2</b></td><td><i><u>97.2*</u></i></td><td> <b>95.0 </b></td><td><b>97.3</b></td></tr>
|
|
|
|
|
115 |
|
116 |
+
<tr><td><b>VLA-Adapter-Pro (Ours)</b></td>
|
117 |
+
<td><b>0.5B</b></td><td><b><i>99.6</i></b></td><td><b><i>99.6</i></b> </td><td><b><i>98.2</i></b></td><td><b><i>96.4</i></b></td><td><b><i>98.5</i></b></td></tr>
|
|
|
|
|
118 |
|
119 |
</table>
|
120 |
|
|
|
121 |
|
122 |
<table>
|
123 |
<tr>
|
124 |
+
<td><strong>CALVIN</strong></td> <td><strong>Methods</strong></td>
|
125 |
+
<td><strong>Scale</strong></td> <td><strong>1</strong></td>
|
126 |
+
<td><strong>2</strong></td> <td><strong>3</strong></td>
|
127 |
+
<td><strong>4</strong></td> <td><strong>5</strong></td> <td><strong>Avg. len</strong></td>
|
128 |
</tr>
|
129 |
|
130 |
+
<tr><td rowspan="8">Large-scale</td><td>UniVLA (Bu et al., 2025) </td><td>7B </td><td>95.5 </td><td>85.8 </td><td>75.4</td><td> 66.9 </td><td>56.5 </td><td>3.80</tr>
|
|
|
|
|
|
|
|
|
|
|
131 |
|
132 |
+
<tr><td>OpenVLA (Kim et al., 2024) </td><td> 7B</td><td> 91.3</td><td> 77.8 </td><td>62.0 </td><td>52.1 </td><td>43.5</td><td> 3.27</td></tr>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
133 |
|
134 |
+
<tr><td>OpenVLA-OFT (Kim et al., 2025)</td><td> 7B</td><td> 96.3</td><td> 89.1 </td><td>82.4</td><td> 75.8</td><td> 66.5</td><td> 4.10</td></tr>
|
|
|
|
|
|
|
|
|
|
|
135 |
|
136 |
+
<tr><td>VLAS (Zhao et al., 2025b) </td><td> 7B</td><td> 87.2 </td><td>64.2</td><td> 40.9 </td><td>28.1</td><td> 19.6 </td><td>2.40</td></tr>
|
137 |
+
|
138 |
+
<tr><td>LCB (Shentu et al., 2024) </td><td> 7B</td><td> 73.6 </td><td>50.2 </td><td>28.5 </td><td>16.0 </td><td>9.9 </td><td>1.78</td></tr>
|
139 |
+
|
140 |
+
<tr><td>RoboDual (Bu et al., 2024a) </td><td> 7B</td><td> 94.4</td><td> 82.7</td><td> 72.1</td><td> 62.4 </td><td>54.4</td><td> 3.66</td></tr>
|
141 |
+
|
142 |
+
<tr><td>OpenHelix (Cui et al., 2025) </td><td> 7B</td><td> <i><u>97.1*</u></i> </td><td>91.4 </td><td>82.8</td><td> 72.6</td><td> 64.1 </td><td>4.08</td></tr>
|
143 |
+
|
144 |
+
<tr><td>ReconVLA (Song et al., 2025c) </td><td> 7B</td><td> 95.6 </td><td>87.6 </td><td>76.9</td><td> 69.3</td><td> 64.1 </td><td>3.95</td></tr>
|
145 |
+
|
146 |
+
<tr><td rowspan="4">Small-scale</td><td>DeeR (Yue et al., 2024) </td><td> 3B</td><td> 86.2</td><td> 70.1 </td><td>51.8</td><td> 41.5</td><td> 30.4 </td><td>2.82</td></tr>
|
147 |
+
|
148 |
+
<tr><td>RoboFlamingo (Li et al., 2024b) </td><td> 3B</td><td> 82.4 </td><td>61.9</td><td> 46.6 </td><td>33.1</td><td> 23.5</td><td> 2.48</td></tr>
|
149 |
+
|
150 |
+
<tr><td>VPP (Hu et al., 2025)</td><td> 1.5B</td><td> 95.7</td><td> 91.2</td><td> <i><u>86.3*</u></i></td><td> <i><u>81.0*</u></i></td><td> <i><u>75.0*</u></i></td><td> <i><u>4.33*</u></i></td></tr>
|
151 |
+
|
152 |
+
<tr><td>SuSIE (Black et al., 2024)</td><td>1.3B</td><td> 87.0</td><td> 69.0</td><td> 49.0 </td><td>38.0</td><td> 26.0</td><td> 2.69</td></tr>
|
153 |
+
|
154 |
+
<tr><td rowspan="5">Tiny-scale</td><td>Seer-Large (Tian et al., 2025)</td><td>0.57B</td><td> 96.3 </td><td><i><u>91.6*</u></i></td><td> 86.1 </td><td>80.3 </td><td>74.0</td><td> 4.28</td></tr>
|
155 |
+
|
156 |
+
<tr><td>MoDE (Reuss et al., 2025) </td><td> 0.44B </td><td>96.2</td><td> 88.9</td><td> 81.1</td><td> 71.8 </td><td>63.5 </td><td>4.01</td></tr>
|
157 |
+
|
158 |
+
<tr><td>Seer (Tian et al., 2025) </td><td> 0.32B</td><td> 94.4 </td><td>87.2 </td><td>79.9 </td><td>72.2 </td><td>64.3</td><td> 3.98</td></tr>
|
159 |
+
|
160 |
+
<tr><td><b>VLA-Adapter (Ours)</b></td>
|
161 |
+
<td><b>0.5B</b></td><td><b><i>99.1</i></b> </td><td><b>94.6</b> </td><td><b>88.8</b></td><td> <b>82.8</b> </td><td><b>76.5</b> </td><td><b>4.42</b></td></tr>
|
162 |
+
|
163 |
+
<tr><td><b>VLA-Adapter-Pro (Ours)</b></td>
|
164 |
+
<td><b>0.5B</b></td><td><b>98.5</b></td><td><b><i>95.0</i></b> </td><td><b><i>90.5</i></b></td><td><b><i>85.3</i></b></td><td><b><i>80.0</i></b></td><td><b><i>4.50</i></b></td></tr>
|
165 |
+
|
166 |
</table>
|
167 |
|
168 |
## Citation instructions
|
169 |
|
170 |
```BibTeX
|
171 |
+
@article{wang2025vlaadapter,
|
172 |
+
author={Wang, Yihao and Ding, Pengxiang and Li, Lingxiao and Cui, Can and Ge, Zirui and Tong, Xinyang and Song, Wenxuan and Zhao, Han and Zhao, Wei and Hou, Pengxu and Huang, Siteng and Tang, Yifan and Wang, Wenhui and Zhang, Ru and Liu, Jianyi and Wang, Donglin},
|
173 |
+
title={VLA-Adapter: An Effective Paradigm for Tiny-Scale Vision-Language-Action Model},
|
174 |
+
journal={arXiv preprint arXiv:2509.09372},
|
175 |
+
year={2025}
|
176 |
}
|
177 |
```
|