🏁 Best viewed with sound on

F1: A Vision Language Action Model Bridging
Understanding and Generation to Actions

🚀 Key Innovations

🧠 Predictive Inverse Dynamics: Visual foresight generation for planning-based control
🏗️ Mixture-of-Transformer: Three specialized experts (Understanding, Generation, Action)
📈 Three-Stage Training: Progressive alignment, pretraining, and adaptation

🤖 Real-World Robot Experiments

Diverse manipulation tasks across multiple robot platforms.

📊 Performance Summary

Task	Platform	F1	π0	Improvement
Multi-task	Genie-1	82.2%	65.2%	+17.0%
Adaptation	Franka	66.7%	53.3%	+13.4%
Long-horizon	ARX LIFT II	40.0%	0.0%	+40.0%
Dynamic Env	ARX LIFT II	66.7%	33.3%	+33.4%

Usage

Please refer to our official repo F1-VLA.

📚 Citation

If you find our work helpful, please cite:

@article{f1_vla_2025,
  title={F1: A Vision Language Action Model Bridging Understanding and Generation to Actions},
  author={Qi Lv and Weijie Kong and Hao Li and Jia Zeng and Zherui Qiu and Delin Qu and Haoming Song and Qizhi Chen and Xiang Deng and Jiangmiao Pang},
  journal={Conference/Journal Name},
  year={2025},
  url={https://arxiv.org/abs/2509.06951}
}

License

This work is under the cc-by-nc-sa-4.0.

Acknowledgements

This repository is based on Lerobot, Any4lerobot, and VAR.

F1: A Vision Language Action Model BridgingUnderstanding and Generation to Actions