ByteDance
/

Sa2VA-1B

@@ -1,15 +1,9 @@
 ---
-license: mit
 pipeline_tag: image-text-to-text
 library_name: transformers
 base_model:
-  - OpenGVLab/InternVL2-1B
-  - OpenGVLab/InternVL2_5-8B
-  - OpenGVLab/InternVL2_5-4B
-  - OpenGVLab/InternViT-300M-448px-V2_5
-  - internlm/internlm2_5-7b-chat
-  - Qwen/Qwen2-0.5B-Instruct
-  - Qwen/Qwen2.5-3B-Instruct
 base_model_relation: merge
 language:
   - multilingual
@@ -34,18 +28,20 @@ Sa2VA is an MLLM capable of question answering, visual prompt understanding, and
 We built the Sa2VA series based on Qwen2-VL and InternVL2/2.5. In the following table, we provide some Sa2VA models built on InternVL2.5. Other Sa2VA models will be open-sourced soon.
-| Model Name |                             Base MLLM                             |                                Language Part                                |                       HF Link                        |
-|:----------:|:-----------------------------------------------------------------:|:---------------------------------------------------------------------------:|:----------------------------------------------------:|
-|  Sa2VA-1B  | [InternVL2.0-1B](https://huggingface.co/OpenGVLab/InternVL2-1B) |   [Qwen2-0.5B-Instruct](https://huggingface.co/Qwen/Qwen2-0.5B-Instruct)    | [🤗 link](https://huggingface.co/ByteDance/Sa2VA-1B) |
-|  Sa2VA-4B  | [InternVL2.5-4B](https://huggingface.co/OpenGVLab/InternVL2_5-4B) |   [Qwen2.5-3B-Instruct](https://huggingface.co/Qwen/Qwen2.5-3B-Instruct)    | [🤗 link](https://huggingface.co/ByteDance/Sa2VA-4B) |
-|  Sa2VA-8B  | [InternVL2.5-8B](https://huggingface.co/OpenGVLab/InternVL2_5-8B) | [internlm2_5-7b-chat](https://huggingface.co/internlm/internlm2_5-7b-chat)  | [🤗 link](https://huggingface.co/ByteDance/Sa2VA-8B) |
 ## Sa2VA Performance
-| Model Name |                             MMBench                             |                                    MME                                     |                       RefCOCO                        | RefCOCO+ | RefCOCOg | MeVIS | DAVIS | ReVOS |
-|:----------:|:---------------------------------------------------------------:|:--------------------------------------------------------------------------:|:----------------------------------------------------:|:----------------------------------------------------:|:----------------------------------------------------:|:----------------------------------------------------:|:----------------------------------------------------:|:-----:|
-|  Sa2VA-1B  |                            1381/405                             | 68.3 | 77.4 | 69.9 | 72.3 | 50.8 | 72.3 | 47.6 |
-|  Sa2VA-4B  |                            1536/530                             | 77.3 | 78.9 | 71.7 | 74.1 | 52.1 | 73.8 | 53.2 |
-|  Sa2VA-8B  | 1617/511 | 81.6 | 81.6 | 76.2 | 78.7 | 57.0 | 75.2 | 57.6 |
 ## Quick Start
@@ -60,7 +56,7 @@ import numpy as np
 import os
 # load the model and tokenizer
-path = "ByteDance/Sa2VA-4B"
 model = AutoModel.from_pretrained(
     path,
     torch_dtype=torch.bfloat16,

 ---
+license: apache-2.0
 pipeline_tag: image-text-to-text
 library_name: transformers
 base_model:
+  - OpenGVLab/InternVL2.5-1B
 base_model_relation: merge
 language:
   - multilingual
 We built the Sa2VA series based on Qwen2-VL and InternVL2/2.5. In the following table, we provide some Sa2VA models built on InternVL2.5. Other Sa2VA models will be open-sourced soon.
+| Model Name |                             Base MLLM                              |                                Language Part                                |                        HF Link                        |
+|:----------:|:------------------------------------------------------------------:|:---------------------------------------------------------------------------:|:-----------------------------------------------------:|
+|  Sa2VA-1B  | [InternVL2.5-1B](https://huggingface.co/OpenGVLab/InternVL2_5-1B)  | [Qwen2.5-0.5B-Instruct](https://huggingface.co/Qwen/Qwen2.5-0.5B-Instruct)  | [🤗 link](https://huggingface.co/ByteDance/Sa2VA-1B)  |
+|  Sa2VA-4B  | [InternVL2.5-4B](https://huggingface.co/OpenGVLab/InternVL2_5-4B)  |   [Qwen2.5-3B-Instruct](https://huggingface.co/Qwen/Qwen2.5-3B-Instruct)    | [🤗 link](https://huggingface.co/ByteDance/Sa2VA-4B)  |
+|  Sa2VA-8B  | [InternVL2.5-8B](https://huggingface.co/OpenGVLab/InternVL2_5-8B)  | [internlm2_5-7b-chat](https://huggingface.co/internlm/internlm2_5-7b-chat)  | [🤗 link](https://huggingface.co/ByteDance/Sa2VA-8B)  |
+| Sa2VA-26B  | [InternVL2.5-26B](https://huggingface.co/OpenGVLab/InternVL2_5-26B) | [internlm2_5-20b-chat](https://huggingface.co/internlm/internlm2_5-20b-chat) | [🤗 link](https://huggingface.co/ByteDance/Sa2VA-26B) |
 ## Sa2VA Performance
+| Model Name |   MME    | MMBench  | RefCOCO | RefCOCO+ | RefCOCOg | MeVIS (val_u) | DAVIS |
+|:----------:|:--------:|:----:|:-------:|:--------:|:--------:|:-------------:|:-----:|
+|  Sa2VA-1B  | 1504/434 | 71.9 |  79.6   |   73.6   |   77.7   |     53.4      | 69.5  |
+|  Sa2VA-4B  | 1691/610 | 81.8 |  82.4   |   77.6   |   79.7   |     55.9      | 73.7  |
+|  Sa2VA-8B  | 1690/610 | 84.4 |  82.6   |   78.0   |   80.3   |     58.9      | 75.9  |
+| Sa2VA-26B | 1698/653 | 85.8 |  82.9   |   79.3   |   81.2   |     61.8      | 78.6  |
 ## Quick Start
 import os
 # load the model and tokenizer
+path = "ByteDance/Sa2VA-1B"
 model = AutoModel.from_pretrained(
     path,
     torch_dtype=torch.bfloat16,