Unified Vision Transformer with Native Resolution

---
license: apache-2.0
datasets:
- mlfoundations/datacomp_1b
- kakaobrain/coyo-700m
- laion/laion400m
language:
- en
- zh
metrics:
- accuracy
- recall
pipeline_tag: feature-extraction
library_name: transformers
---

<h1 align="center">Unified Vision Transformer with Native Resolution</h1>


## 🌠 Introduction

We present **UniViTAR**, a family of homogeneous vision foundation models tailored **for unified visual modality and native resolution scenario** in the era of multimodal. We train our UniViTAR family across multiple model scales from **0.3B to 1.4B** exclusively on public accessible image-caption data (14.6B), and observe a trend of performance increasing with parameter scaling. UniViTAR is a Transformer-based encoder model that inherits the original architecture of the conventional Vision Transformer but incorporates the following advanced modifications: *Unified Patchify for Native Image and Video Modality, 2D RoPE, SwiGLU, RMSNorm, and QK-Norm*. 


## 🛠️ Environment   
   ```bash
   conda create -n univitar python=3.11 -y
   conda activate univitar
   pip3 install einops==0.8.0 ninja==1.11.1.1 numpy==1.26.4 pillow==10.4.0 psutil==6.0.0 torch==2.2.2 torchvision==0.17.2 transformers==4.49.0 timm==1.0.14
   pip3 install flash-attn==2.6.3
   ```


## 🗝️ Model Usage

```python
import torch
import numpy as np
from PIL import Image
from modeling_univitar import UniViTARVisionModel

# Prepare Model
model = UniViTARVisionModel("config.json")
_ = model.load_state_dict(torch.load(f"pytorch_model.bin", map_location="cpu"))
model = model.to(torch.bfloat16).cuda()

# Prepare Data: [(3, H1, W1), ..., (3, Hn, Wn)] --> (N1+...+Nn, P)
images = [Image.open(f"xx1.jpg"), Image.open(f"xx2.jpg")]
data_inputs, grid_shapes = [], []
for image in images:
    data_item = model.image_transform(image)
    input_data, grid_shape = model.data_patchify(data_item)
    data_inputs.append(input_data.to(torch.bfloat16).cuda())
    grid_shapes.append(grid_shape)
data_inputs = torch.concatenate(data_inputs, dim=0)

# Forward: (N1+...+Nn, P) --> [(N1, D), ..., (Nn, D)]
data_embeds = model(pixel_values=data_inputs, grid_shapes=grid_shapes)
data_embeds = data_embeds.split([np.prod(grid_shape) for grid_shape in grid_shapes])
print(data_embeds[0].shape, data_embeds[1].shape)
```

## 📈 Evaluation

| Model | Size | \#Seen | IN1K<sup>ZS<sup> | IN1K<sup>LP<sup> | Flickr<sup>T2I<sup> | Flickr<sup>I2T<sup> | K400<sup>ZS<sup>  | ADE20K | 
|--------|-----|----|------|------|------|------|------|------|
| [UniViTAR-0.3B](https://huggingface.co/MM-MVR/UniViTAR-0.3B) | 310M | 14.6B  | 81.5  | 87.7 | 84.0 | 95.1 | 66.0 | 54.6 |
| [UniViTAR-0.6B](https://huggingface.co/MM-MVR/UniViTAR-0.6B) | 637M | 14.6B | 82.3  | 88.3 | 84.1 | 95.5 | 68.6 | 55.1 |
| [UniViTAR-1B](https://huggingface.co/MM-MVR/UniViTAR-1B) | 1419M | 14.6B  | 82.9  | 89.2 | 83.5 | 95.1 | 69.0 | 56.2 |

<font size=1>*ZS: Zero-shot Classification, LP: Linear-Probe Classification, T2I/I2T: Text-to-Image/Image-to-Text Retrieval*</font>


## ✏️ Reference

If you find UniViTAR useful in your research or applications, please consider citing the following BibTeX:

```
@article{qiao2025univitar,
title={UniViTAR: Unified Vision Transformer with Native Resolution},
author={Qiao, Limeng and Gan, Yiyang and Wang, Bairui and Qin, Jie and Xu, Shuang and Yang, Siqi and Ma, Lin},
journal={arXiv preprint arXiv:2504.01792},
year={2025}
}
```