This project presents UniFlow (Unified Pixel Flow Tokenizer), a unified continuous visual tokenizer that eliminates the conflict between semantic understanding and high-fidelity reconstruction. By marrying a pretrained vision encoder with a patch-wise flow decoder, UniFlow delivers state-of-the-art reconstruction and multimodal understanding via a single encoder tokenizer, while surpassing both discrete and continuous counterparts.
Comparison of different training paradigms for unified tokenizers. UniFlow achieves a new state-of-the-art in reconstruction fidelity and multimodal understanding, surpassing both discrete and continuous unified tokenizers, while offering high compression ratios. (All multimodal large language models are trained on LLaVA-v1.5 data with Vicuna-7B, except that TokenFlow uses Vicuna-13B.)
Various downstream tasks demonstrate UniFlow's robust visual representation.
π₯ Updates
- [2025.10.01] π π π We are excited to release UniFlow, a powerful unified tokenizer featuring our novel Layer-wise Adaptative Distillation and a Patch-wise Pixel Flow Decoder. Code and pretrained models are now available!
π Quick Start
Inference
Simply test the effect of each model reconstruction:
bash quick_start.ipynb
β€οΈ Acknowledgement
Our work builds upon the foundations laid by many excellent projects in the field. We would like to thank the authors of MAR. We also drew inspiration from the methodologies presented in FlowMo, InternVideo2. We are grateful for their contributions to the community.
- Downloads last month
- 10