Upload turing-motors/Heron-NVILA-Lite-2B
Browse files
README.md
CHANGED
@@ -113,16 +113,19 @@ print("---" * 40)
|
|
113 |
| Stage3 | Vision Encoder, Projector, LLM | [llava-instruct-v1_5-en-subset-358k](https://huggingface.co/datasets/llm-jp/llava-instruct-v1_5-en-subset-358k), [llava-instruct-ja](https://huggingface.co/datasets/llm-jp/llava-instruct-ja), [japanese-photos-conv](https://huggingface.co/datasets/llm-jp/japanese-photos-conversation), [ja-vg-vqa](https://huggingface.co/datasets/llm-jp/ja-vg-vqa-conversation), [synthdog-ja (subset)](https://huggingface.co/datasets/naver-clova-ix/synthdog-ja), [ai2d](https://huggingface.co/datasets/lmms-lab/ai2d), [synthdog-en](https://huggingface.co/datasets/naver-clova-ix/synthdog-en), [sherlock](https://github.com/allenai/sherlock) | 1.4M |
|
114 |
|
115 |
## Evaluation
|
116 |
-
|
|
|
117 |
|
118 |
| Model | LLM Size | Heron-Bench overall LLM (%) | JA-VLM-Bench-In-the-Wild LLM (/5.0) | JA-VG-VQA-500 LLM (/5.0) |
|
119 |
|--------------------------------|----------|------------------------------|-------------------------------------|--------------------------|
|
|
|
120 |
| **Heron NVILA-Lite 2B** | 1.5B | 52.8 | 3.52 | 3.50 |
|
121 |
| **[Heron NVILA-Lite 15B](https://huggingface.co/turing-motors/Heron-NVILA-Lite-15B)** | 14B | 59.6 | 4.2 | 3.82 |
|
122 |
| [LLaVA-CALM2-SigLIP](https://huggingface.co/cyberagent/llava-calm2-siglip) | 7B | 43.3 | 3.15 | 3.21 |
|
123 |
| [Llama-3-EvoVLM-JP-v2](https://huggingface.co/SakanaAI/Llama-3-EvoVLM-JP-v2) | 8B | 39.3 | 2.92 | 2.96 |
|
124 |
| [VILA-jp](https://huggingface.co/llm-jp/llm-jp-3-vila-14b) | 13B | 57.2 | 3.69 | 3.62 |
|
125 |
| [Asagi-14B](https://huggingface.co/MIL-UT/Asagi-14B) | 13B | 55.8 | 3.44 | 3.84 |
|
|
|
126 |
| [Qwen2-VL 7B Instruct](https://huggingface.co/Qwen/Qwen2-VL-7B-Instruct) | 7B | 55.5 | 3.61 | 3.6 |
|
127 |
| GPT-4o | - | 87.6 | 3.85 | 3.58 |
|
128 |
|
@@ -151,8 +154,12 @@ This model is based on the results obtained in the project, subsidized by the [G
|
|
151 |
primaryClass={cs.CV},
|
152 |
url={https://arxiv.org/abs/2412.04468},
|
153 |
}
|
154 |
-
```
|
155 |
|
156 |
-
|
157 |
-
|
158 |
-
|
|
|
|
|
|
|
|
|
|
|
|
113 |
| Stage3 | Vision Encoder, Projector, LLM | [llava-instruct-v1_5-en-subset-358k](https://huggingface.co/datasets/llm-jp/llava-instruct-v1_5-en-subset-358k), [llava-instruct-ja](https://huggingface.co/datasets/llm-jp/llava-instruct-ja), [japanese-photos-conv](https://huggingface.co/datasets/llm-jp/japanese-photos-conversation), [ja-vg-vqa](https://huggingface.co/datasets/llm-jp/ja-vg-vqa-conversation), [synthdog-ja (subset)](https://huggingface.co/datasets/naver-clova-ix/synthdog-ja), [ai2d](https://huggingface.co/datasets/lmms-lab/ai2d), [synthdog-en](https://huggingface.co/datasets/naver-clova-ix/synthdog-en), [sherlock](https://github.com/allenai/sherlock) | 1.4M |
|
114 |
|
115 |
## Evaluation
|
116 |
+
|
117 |
+
I used [llm-jp-eval-mm](https://github.com/llm-jp/llm-jp-eval-mm) for this evaluation. Scores for models other than Heron NVILA-Lite and Sarashina2-Vision-14B were taken from [llm-jp-eval-mm leaderboard](https://llm-jp.github.io/llm-jp-eval-mm/) and the [Asagi website](https://uehara-mech.github.io/asagi-vlm?v=1). Heron NVILA-Lite and Sarashina2-Vision-14B evaluated using llm-as-a-judge with "gpt-4o-2024-05-13". Sarashina2-Vision-14B evaluated on the [official blog](https://www.sbintuitions.co.jp/blog/entry/2025/03/17/111703) using "gpt-4o-2024-08-06". Due to differences in evaluation conditions, results for Sarashina2-Vision-14B should be considered as reference only.
|
118 |
|
119 |
| Model | LLM Size | Heron-Bench overall LLM (%) | JA-VLM-Bench-In-the-Wild LLM (/5.0) | JA-VG-VQA-500 LLM (/5.0) |
|
120 |
|--------------------------------|----------|------------------------------|-------------------------------------|--------------------------|
|
121 |
+
| **[Heron NVILA-Lite 1B](https://huggingface.co/turing-motors/Heron-NVILA-Lite-1B)** | 0.5B | 45.9 | 2.92 | 3.16 |
|
122 |
| **Heron NVILA-Lite 2B** | 1.5B | 52.8 | 3.52 | 3.50 |
|
123 |
| **[Heron NVILA-Lite 15B](https://huggingface.co/turing-motors/Heron-NVILA-Lite-15B)** | 14B | 59.6 | 4.2 | 3.82 |
|
124 |
| [LLaVA-CALM2-SigLIP](https://huggingface.co/cyberagent/llava-calm2-siglip) | 7B | 43.3 | 3.15 | 3.21 |
|
125 |
| [Llama-3-EvoVLM-JP-v2](https://huggingface.co/SakanaAI/Llama-3-EvoVLM-JP-v2) | 8B | 39.3 | 2.92 | 2.96 |
|
126 |
| [VILA-jp](https://huggingface.co/llm-jp/llm-jp-3-vila-14b) | 13B | 57.2 | 3.69 | 3.62 |
|
127 |
| [Asagi-14B](https://huggingface.co/MIL-UT/Asagi-14B) | 13B | 55.8 | 3.44 | 3.84 |
|
128 |
+
| [Sarashina2-Vision-14B](https://huggingface.co/sbintuitions/sarashina2-vision-14b) | 13B | 50.9 | 4.1 | 3.43 |
|
129 |
| [Qwen2-VL 7B Instruct](https://huggingface.co/Qwen/Qwen2-VL-7B-Instruct) | 7B | 55.5 | 3.61 | 3.6 |
|
130 |
| GPT-4o | - | 87.6 | 3.85 | 3.58 |
|
131 |
|
|
|
154 |
primaryClass={cs.CV},
|
155 |
url={https://arxiv.org/abs/2412.04468},
|
156 |
}
|
|
|
157 |
|
158 |
+
@inproceedings{maeda2025llm-jp-eval-mm,
|
159 |
+
author = {前田 航希 and 杉浦 一瑳 and 小田 悠介 and 栗田 修平 and 岡崎 直観},
|
160 |
+
month = mar,
|
161 |
+
series = {言語処理学会第31回年次大会 (NLP2025)},
|
162 |
+
title = {{llm-jp-eval-mm: 日本語視覚言語モデルの自動評価基盤}},
|
163 |
+
year = {2025}
|
164 |
+
}
|
165 |
+
```
|