arxiv:2506.01844

SmolVLA: A Vision-Language-Action Model for Affordable and Efficient Robotics

Published on Jun 2

· Submitted by

andito on Jun 3

#2 Paper of the day

Upvote

Authors:

Mustafa Shukor ,

Dana Aubakirova ,

Francesco Capuano ,

Pepijn Kooijmans ,

Steven Palma ,

Adil Zouitine ,

Michel Aractingi ,

Caroline Pascal ,

Martino Russi ,

Andres Marafioti ,

Simon Alibert ,

Thomas Wolf ,

Remi Cadene

Abstract

SmolVLA is a compact, efficient vision-language-action model that achieves competitive performance at reduced computational costs and can be deployed on consumer-grade hardware.

AI-generated summary

Vision-language models (VLMs) pretrained on large-scale multimodal datasets encode rich visual and linguistic knowledge, making them a strong foundation for robotics. Rather than training robotic policies from scratch, recent approaches adapt VLMs into vision-language-action (VLA) models that enable natural language-driven perception and control. However, existing VLAs are typically massive--often with billions of parameters--leading to high training costs and limited real-world deployability. Moreover, they rely on academic and industrial datasets, overlooking the growing availability of community-collected data from affordable robotic platforms. In this work, we present SmolVLA, a small, efficient, and community-driven VLA that drastically reduces both training and inference costs, while retaining competitive performance. SmolVLA is designed to be trained on a single GPU and deployed on consumer-grade GPUs or even CPUs. To further improve responsiveness, we introduce an asynchronous inference stack decoupling perception and action prediction from action execution, allowing higher control rates with chunked action generation. Despite its compact size, SmolVLA achieves performance comparable to VLAs that are 10x larger. We evaluate SmolVLA on a range of both simulated as well as real-world robotic benchmarks and release all code, pretrained models, and training data.

View arXiv page View PDF Add to collection

Community

andito

Paper author Paper submitter 2 days ago

SmolVLA is a small, efficient, and community-driven VLA that drastically reduces both training and inference costs, while retaining competitive performance.
Authors will be around so let's talk!

m-ric

2 days ago

Wowow this is super cool! (sorry for low info comment)

Aurelien-Morgan

2 days ago

Great read. Section 3 is a goldmine of its own.

fracapuano

Paper author about 21 hours ago

🥰 thank you so much! 🤗

MilkClouds

1 day ago

The paper states that model is trained on 4 GPU, corresponding to 30k gpu hours but it is equivalent as 30k/24/4=312 days. Is the number correct?

landy123007

about 10 hours ago

I asked the author the same question.
it's project's sum, which accounts for 100+ models trained due to architecture tweaking, hyperparameter tuning, ablations, and ofc testing.

willnorris

1 day ago

Especially love the async inference contributions here. After trying to run Gr00t on a cloud GPU a few weeks back and experiencing the network latencies significantly impacting performance, I really appreciate the idea of parallelising inference with action execution.

I hope we see other VLAs adopting this architecture, it feels like a key step toward robots sharing cloud GPUs rather than depending on local hardware (reducing marginal cost & increasing maintainability!).

fracapuano

Paper author about 21 hours ago

Hey @willnorris thank you so much for your words---we're glad you liked the report, and async inference 😉
We're hard at work to make sure the stack lands on main soon. It's already compatible with all the policy types LeRobot supports, and open-sourcing everything is our effort to make this the standard paradigm for the community. Why lagging? 🤓

If you're interested in following progress, check the PR here 🔗 https://github.com/huggingface/lerobot/pull/1196

fracapuano

Paper author about 21 hours ago

If you're interested in following progress, check the PR here 🔗 https://github.com/huggingface/lerobot/pull/1196

librarian-bot

about 11 hours ago

This is an automated message from the Librarian Bot. I found the following papers similar to this paper.

The following papers were recommended by the Semantic Scholar API

Please give a thumbs up to this comment if you found it helpful!

If you want recommendations for any Paper on Hugging Face checkout this Space

You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: @librarian-bot recommend

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment

Upvote

Models citing this paper 1

Datasets citing this paper 1

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2506.01844 in a Space README.md to link it from this page.

SmolVLA: A Vision-Language-Action Model for Affordable and Efficient Robotics

Abstract

Community

Models citing this paper 1

Datasets citing this paper 1

Spaces citing this paper 0

Collections including this paper 14