--- license: apache-2.0 datasets: - mlfoundations/datacomp_1b - kakaobrain/coyo-700m - laion/laion400m language: - en - zh metrics: - accuracy - recall pipeline_tag: feature-extraction library_name: transformers ---

Unified Vision Transformer with Native Resolution

## 🌠 Introduction We present **UniViTAR**, a family of homogeneous vision foundation models tailored **for unified visual modality and native resolution scenario** in the era of multimodal. We train our UniViTAR family across multiple model scales from **0.3B to 1.4B** exclusively on public accessible image-caption data (14.6B), and observe a trend of performance increasing with parameter scaling. UniViTAR is a Transformer-based encoder model that inherits the original architecture of the conventional Vision Transformer but incorporates the following advanced modifications: *Unified Patchify for Native Image and Video Modality, 2D RoPE, SwiGLU, RMSNorm, and QK-Norm*. ## 🛠️ Environment ```bash conda create -n univitar python=3.11 -y conda activate univitar pip3 install einops==0.8.0 ninja==1.11.1.1 numpy==1.26.4 pillow==10.4.0 psutil==6.0.0 torch==2.2.2 torchvision==0.17.2 transformers==4.49.0 timm==1.0.14 pip3 install flash-attn==2.6.3 ``` ## 🗝️ Model Usage ```python import torch import numpy as np from PIL import Image from modeling_univitar import UniViTARVisionModel # Prepare Model model = UniViTARVisionModel("config.json") _ = model.load_state_dict(torch.load(f"pytorch_model.bin", map_location="cpu")) model = model.to(torch.bfloat16).cuda() # Prepare Data: [(3, H1, W1), ..., (3, Hn, Wn)] --> (N1+...+Nn, P) images = [Image.open(f"xx1.jpg"), Image.open(f"xx2.jpg")] data_inputs, grid_shapes = [], [] for image in images: data_item = model.image_transform(image) input_data, grid_shape = model.data_patchify(data_item) data_inputs.append(input_data.to(torch.bfloat16).cuda()) grid_shapes.append(grid_shape) data_inputs = torch.concatenate(data_inputs, dim=0) # Forward: (N1+...+Nn, P) --> [(N1, D), ..., (Nn, D)] data_embeds = model(pixel_values=data_inputs, grid_shapes=grid_shapes) data_embeds = data_embeds.split([np.prod(grid_shape) for grid_shape in grid_shapes]) print(data_embeds[0].shape, data_embeds[1].shape) ``` ## 📈 Evaluation | Model | Size | \#Seen | IN1KZS | IN1KLP | FlickrT2I | FlickrI2T | K400ZS | ADE20K | |--------|-----|----|------|------|------|------|------|------| | [UniViTAR-0.3B](https://huggingface.co/MM-MVR/UniViTAR-0.3B) | 310M | 14.6B | 81.5 | 87.7 | 84.0 | 95.1 | 66.0 | 54.6 | | [UniViTAR-0.6B](https://huggingface.co/MM-MVR/UniViTAR-0.6B) | 637M | 14.6B | 82.3 | 88.3 | 84.1 | 95.5 | 68.6 | 55.1 | | [UniViTAR-1B](https://huggingface.co/MM-MVR/UniViTAR-1B) | 1419M | 14.6B | 82.9 | 89.2 | 83.5 | 95.1 | 69.0 | 56.2 | *ZS: Zero-shot Classification, LP: Linear-Probe Classification, T2I/I2T: Text-to-Image/Image-to-Text Retrieval* ## ✏️ Reference If you find UniViTAR useful in your research or applications, please consider citing the following BibTeX: ``` @article{qiao2025univitar, title={UniViTAR: Unified Vision Transformer with Native Resolution}, author={Qiao, Limeng and Gan, Yiyang and Wang, Bairui and Qin, Jie and Xu, Shuang and Yang, Siqi and Ma, Lin}, journal={arXiv preprint arXiv:2504.01792}, year={2025} } ```