Image-Text-to-Text
Safetensors
qwen2
conversational


ViCToR Model Card

Model details

Paper or resources for more information: https://github.com/deepglint/Victor

Where to send questions or comments about the model: https://github.com/deepglint/Victor/issues

Results

Benchmark ViCTOR-7B LLaVA-1.5-13B LLaVA-NeXT-8B Ross
MMStar 54.3 34.3 43.9 53.9
RealWorldQA 65.6 55.3 58.4 58.7
MMBench^(cn,val) 79.0 67.8 – –
OCRBench 556 337 531 553
POPE 88.4 88.4 87.1 88.1
MMU 48.9 37.0 43.1 49.0
A12D 79.5 61.1 72.8 79.5
MME 2071 1781 1908 1854
SEED^(f) 75.7 68.2 72.5 73.6

Citation

@inproceedings{Xie2024ViCToRIV,
  title={ViCToR: Improving Visual Comprehension via Token Reconstruction for Pretraining LMMs},
  author={Yin Xie and Kaicheng Yang and Peirou Liang and Xiang An and Yongle Zhao and Yumeng Wang and Ziyong Feng and Roy Miles and Ismail Elezi and Jiankang Deng},
  year={2024},
  url={https://api.semanticscholar.org/CorpusID:273482504}
}
Downloads last month
27
Safetensors
Model size
8.22B params
Tensor type
BF16
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Datasets used to train DeepGlint-AI/ViCToR-LLaVA-SigLIP2-Qwen2.5-7b