WorldVLA: Towards Autoregressive Action World Model

If our project helps you, please give us a star ⭐ on GitHub to support us. πŸ™πŸ™

arXiv GitHub hf_checkpoint License

🌟 Introduction

WorldVLA is an autoregressive action world model that unifies action and image understanding and generation. WorldVLA intergrates Vision-Language-Action (VLA) model (action model) and world model in one single framework.


Action Model Results (Text + Image -> Action)

Action Model generates actions given the text instruction and image observations.

Input: Open the middle drawer of the cabinet.
Input: Pick up the alphabet soup and place it in the basket.
Input: Pick up the black bowl between the plate and the ramekin and place it on the plate.

World Model Results (Action + Image -> Image)

World Model generates the next frame given the current frame and action control.

Input: Action sequence of "Open the top drawer and put the bowl inside". Input: Action sequence of "Push the plate to the front of the stove". Input: Action sequence of "Put the bowl on the stove".

Model Zoo

Citation

If you find the project helpful for your research, please consider citing our paper:

@article{cen2025worldvla,
  title={WorldVLA: Towards Autoregressive Action World Model},
  author={Cen, Jun and Yu, Chaohui and Yuan, Hangjie and Jiang, Yuming and Huang, Siteng and Guo, Jiayan and Li, Xin and Song, Yibing and Luo, Hao and Wang, Fan and others},
  journal={arXiv preprint arXiv:2506.21539},
  year={2025}
}

Acknowledgment

This project builds upon Lumina-mGPT, Chemeleon, and OpenVLA. We thank these teams for their open-source contributions.

Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Model tree for Alibaba-DAMO-Academy/WorldVLA

Finetuned
(1)
this model