---
license: apache-2.0
datasets:
- mlfoundations/datacomp_1b
- kakaobrain/coyo-700m
- laion/laion400m
language:
- en
- zh
metrics:
- accuracy
- recall
pipeline_tag: feature-extraction
library_name: transformers
---
Unified Vision Transformer with Native Resolution
## 🌠 Introduction
We present **UniViTAR**, a family of homogeneous vision foundation models tailored **for unified visual modality and native resolution scenario** in the era of multimodal. We train our UniViTAR family across multiple model scales from **0.3B to 1.4B** exclusively on public accessible image-caption data (14.6B), and observe a trend of performance increasing with parameter scaling. UniViTAR is a Transformer-based encoder model that inherits the original architecture of the conventional Vision Transformer but incorporates the following advanced modifications: *Unified Patchify for Native Image and Video Modality, 2D RoPE, SwiGLU, RMSNorm, and QK-Norm*.
## 🛠️ Environment
```bash
conda create -n univitar python=3.11 -y
conda activate univitar
pip3 install einops==0.8.0 ninja==1.11.1.1 numpy==1.26.4 pillow==10.4.0 psutil==6.0.0 torch==2.2.2 torchvision==0.17.2 transformers==4.49.0 timm==1.0.14
pip3 install flash-attn==2.6.3
```
## 🗝️ Model Usage
```python
import torch
import numpy as np
from PIL import Image
from modeling_univitar import UniViTARVisionModel
# Prepare Model
model = UniViTARVisionModel("config.json")
_ = model.load_state_dict(torch.load(f"pytorch_model.bin", map_location="cpu"))
model = model.to(torch.bfloat16).cuda()
# Prepare Data: [(3, H1, W1), ..., (3, Hn, Wn)] --> (N1+...+Nn, P)
images = [Image.open(f"xx1.jpg"), Image.open(f"xx2.jpg")]
data_inputs, grid_shapes = [], []
for image in images:
data_item = model.image_transform(image)
input_data, grid_shape = model.data_patchify(data_item)
data_inputs.append(input_data.to(torch.bfloat16).cuda())
grid_shapes.append(grid_shape)
data_inputs = torch.concatenate(data_inputs, dim=0)
# Forward: (N1+...+Nn, P) --> [(N1, D), ..., (Nn, D)]
data_embeds = model(pixel_values=data_inputs, grid_shapes=grid_shapes)
data_embeds = data_embeds.split([np.prod(grid_shape) for grid_shape in grid_shapes])
print(data_embeds[0].shape, data_embeds[1].shape)
```
## 📈 Evaluation
| Model | Size | \#Seen | IN1KZS | IN1KLP | FlickrT2I | FlickrI2T | K400ZS | ADE20K |
|--------|-----|----|------|------|------|------|------|------|
| [UniViTAR-0.3B](https://huggingface.co/MM-MVR/UniViTAR-0.3B) | 310M | 14.6B | 81.5 | 87.7 | 84.0 | 95.1 | 66.0 | 54.6 |
| [UniViTAR-0.6B](https://huggingface.co/MM-MVR/UniViTAR-0.6B) | 637M | 14.6B | 82.3 | 88.3 | 84.1 | 95.5 | 68.6 | 55.1 |
| [UniViTAR-1B](https://huggingface.co/MM-MVR/UniViTAR-1B) | 1419M | 14.6B | 82.9 | 89.2 | 83.5 | 95.1 | 69.0 | 56.2 |
*ZS: Zero-shot Classification, LP: Linear-Probe Classification, T2I/I2T: Text-to-Image/Image-to-Text Retrieval*
## ✏️ Reference
If you find UniViTAR useful in your research or applications, please consider citing the following BibTeX:
```
@article{qiao2025univitar,
title={UniViTAR: Unified Vision Transformer with Native Resolution},
author={Qiao, Limeng and Gan, Yiyang and Wang, Bairui and Qin, Jie and Xu, Shuang and Yang, Siqi and Ma, Lin},
journal={arXiv preprint arXiv:2504.01792},
year={2025}
}
```