Alibaba-DAMO-Academy/WorldVLA

WorldVLA: Towards Autoregressive Action World Model

If our project helps you, please give us a star ⭐ on GitHub to support us. 🙏🙏

🌟 Introduction

WorldVLA is an autoregressive action world model that unifies action and image understanding and generation. WorldVLA intergrates Vision-Language-Action (VLA) model (action model) and world model in one single framework.

Action Model Results (Text + Image -> Action)

Action Model generates actions given the text instruction and image observations.


Input: Open the middle drawer of the cabinet.	Input: Pick up the alphabet soup and place it in the basket.	Input: Pick up the black bowl between the plate and the ramekin and place it on the plate.

World Model Results (Action + Image -> Image)

World Model generates the next frame given the current frame and action control.


Input: Action sequence of "Open the top drawer and put the bowl inside".	Input: Action sequence of "Push the plate to the front of the stove".	Input: Action sequence of "Put the bowl on the stove".

Model Zoo

Model (256 * 256)	HF Link	Success Rate (%)
LIBERO-Spatial	Alibaba-DAMO-Academy/WorldVLA/model_256/libero_spatial	85.6
LIBERO-Object	Alibaba-DAMO-Academy/WorldVLA/model_256/libero_object	89.0
LIBERO-Goal	Alibaba-DAMO-Academy/WorldVLA/model_256/libero_goal	82.6
LIBERO-Long	Alibaba-DAMO-Academy/WorldVLA/model_256/libero_10	59.0

Model (512 * 512)	HF Link	Success Rate (%)
LIBERO-Spatial	Alibaba-DAMO-Academy/WorldVLA/model_512/libero_spatial	87.6
LIBERO-Object	Alibaba-DAMO-Academy/WorldVLA/model_512/libero_object	96.2
LIBERO-Goal	Alibaba-DAMO-Academy/WorldVLA/model_512/libero_goal	83.4
LIBERO-Long	Alibaba-DAMO-Academy/WorldVLA/model_512/libero_10	60.0

Citation

If you find the project helpful for your research, please consider citing our paper:

@article{cen2025worldvla,
  title={WorldVLA: Towards Autoregressive Action World Model},
  author={Cen, Jun and Yu, Chaohui and Yuan, Hangjie and Jiang, Yuming and Huang, Siteng and Guo, Jiayan and Li, Xin and Song, Yibing and Luo, Hao and Wang, Fan and others},
  journal={arXiv preprint arXiv:2506.21539},
  year={2025}
}

Acknowledgment

This project builds upon Lumina-mGPT, Chemeleon, and OpenVLA. We thank these teams for their open-source contributions.

Alibaba-DAMO-Academy
/

WorldVLA