🏁 Best viewed with sound on

F1: A Vision Language Action Model Bridging
Understanding and Generation to Actions

Paper Code Website

πŸš€ Key Innovations

  • 🧠 Predictive Inverse Dynamics: Visual foresight generation for planning-based control
  • πŸ—οΈ Mixture-of-Transformer: Three specialized experts (Understanding, Generation, Action)
  • πŸ“ˆ Three-Stage Training: Progressive alignment, pretraining, and adaptation

πŸ€– Real-World Robot Experiments

Diverse manipulation tasks across multiple robot platforms.

πŸ“Š Performance Summary

Task Platform F1 Ο€0 Improvement
Multi-task Genie-1 82.2% 65.2% +17.0%
Adaptation Franka 66.7% 53.3% +13.4%
Long-horizon ARX LIFT II 40.0% 0.0% +40.0%
Dynamic Env ARX LIFT II 66.7% 33.3% +33.4%

Usage

Please refer to our official repo F1-VLA.

πŸ“š Citation

If you find our work helpful, please cite:

@article{f1_vla_2025,
  title={F1: A Vision Language Action Model Bridging Understanding and Generation to Actions},
  author={Qi Lv and Weijie Kong and Hao Li and Jia Zeng and Zherui Qiu and Delin Qu and Haoming Song and Qizhi Chen and Xiang Deng and Jiangmiao Pang},
  journal={Conference/Journal Name},
  year={2025},
  url={https://arxiv.org/abs/2509.06951}
}

License

This work is under the cc-by-nc-sa-4.0.

Acknowledgements

This repository is based on Lerobot, Any4lerobot, and VAR.

Downloads last month
12
Safetensors
Model size
4.19B params
Tensor type
I64
Β·
F32
Β·
BF16
Β·
Video Preview
loading