turing-motors
/

Heron-NVILA-Lite-2B

@@ -113,16 +113,19 @@ print("---" * 40)
 | Stage3 | Vision Encoder, Projector, LLM | [llava-instruct-v1_5-en-subset-358k](https://huggingface.co/datasets/llm-jp/llava-instruct-v1_5-en-subset-358k), [llava-instruct-ja](https://huggingface.co/datasets/llm-jp/llava-instruct-ja), [japanese-photos-conv](https://huggingface.co/datasets/llm-jp/japanese-photos-conversation), [ja-vg-vqa](https://huggingface.co/datasets/llm-jp/ja-vg-vqa-conversation), [synthdog-ja (subset)](https://huggingface.co/datasets/naver-clova-ix/synthdog-ja), [ai2d](https://huggingface.co/datasets/lmms-lab/ai2d), [synthdog-en](https://huggingface.co/datasets/naver-clova-ix/synthdog-en), [sherlock](https://github.com/allenai/sherlock)    | 1.4M      |
 ## Evaluation
-I used [llm-jp-eval-mm](https://github.com/llm-jp/llm-jp-eval-mm) for this evaluation. All scores other than our models are taken from [llm-jp-eval-mm leaderboard](https://llm-jp.github.io/llm-jp-eval-mm/) and the [Asagi website](https://uehara-mech.github.io/asagi-vlm?v=1).
 | Model                          | LLM Size | Heron-Bench overall LLM (%) | JA-VLM-Bench-In-the-Wild LLM (/5.0) | JA-VG-VQA-500 LLM (/5.0) |
 |--------------------------------|----------|------------------------------|-------------------------------------|--------------------------|
 | **Heron NVILA-Lite 2B**        | 1.5B     | 52.8                         | 3.52                                | 3.50                     |
 | **[Heron NVILA-Lite 15B](https://huggingface.co/turing-motors/Heron-NVILA-Lite-15B)**       | 14B      | 59.6                         | 4.2                                 | 3.82                     |
 | [LLaVA-CALM2-SigLIP](https://huggingface.co/cyberagent/llava-calm2-siglip)             | 7B      | 43.3                        | 3.15                                | 3.21                     |
 | [Llama-3-EvoVLM-JP-v2](https://huggingface.co/SakanaAI/Llama-3-EvoVLM-JP-v2)           | 8B      | 39.3                        | 2.92                                | 2.96                     |
 | [VILA-jp](https://huggingface.co/llm-jp/llm-jp-3-vila-14b)                        | 13B     | 57.2                        | 3.69                                | 3.62                     |
 | [Asagi-14B](https://huggingface.co/MIL-UT/Asagi-14B)                      | 13B     | 55.8                        | 3.44                                | 3.84                     |
 | [Qwen2-VL 7B Instruct](https://huggingface.co/Qwen/Qwen2-VL-7B-Instruct)                         | 7B       | 55.5                        | 3.61                                | 3.6                     |
 | GPT-4o                         | -       | 87.6                        | 3.85                                | 3.58                     |
@@ -151,8 +154,12 @@ This model is based on the results obtained in the project, subsidized by the [G
       primaryClass={cs.CV},
       url={https://arxiv.org/abs/2412.04468},
 }
-```
-## Model Card Authors
-Shingo Yokoi

 | Stage3 | Vision Encoder, Projector, LLM | [llava-instruct-v1_5-en-subset-358k](https://huggingface.co/datasets/llm-jp/llava-instruct-v1_5-en-subset-358k), [llava-instruct-ja](https://huggingface.co/datasets/llm-jp/llava-instruct-ja), [japanese-photos-conv](https://huggingface.co/datasets/llm-jp/japanese-photos-conversation), [ja-vg-vqa](https://huggingface.co/datasets/llm-jp/ja-vg-vqa-conversation), [synthdog-ja (subset)](https://huggingface.co/datasets/naver-clova-ix/synthdog-ja), [ai2d](https://huggingface.co/datasets/lmms-lab/ai2d), [synthdog-en](https://huggingface.co/datasets/naver-clova-ix/synthdog-en), [sherlock](https://github.com/allenai/sherlock)    | 1.4M      |
 ## Evaluation
+I used [llm-jp-eval-mm](https://github.com/llm-jp/llm-jp-eval-mm) for this evaluation. Scores for models other than Heron NVILA-Lite and Sarashina2-Vision-14B were taken from [llm-jp-eval-mm leaderboard](https://llm-jp.github.io/llm-jp-eval-mm/) and the [Asagi website](https://uehara-mech.github.io/asagi-vlm?v=1). Heron NVILA-Lite and Sarashina2-Vision-14B evaluated using llm-as-a-judge with "gpt-4o-2024-05-13". Sarashina2-Vision-14B evaluated on the [official blog](https://www.sbintuitions.co.jp/blog/entry/2025/03/17/111703) using "gpt-4o-2024-08-06". Due to differences in evaluation conditions, results for Sarashina2-Vision-14B should be considered as reference only.
 | Model                          | LLM Size | Heron-Bench overall LLM (%) | JA-VLM-Bench-In-the-Wild LLM (/5.0) | JA-VG-VQA-500 LLM (/5.0) |
 |--------------------------------|----------|------------------------------|-------------------------------------|--------------------------|
+| **[Heron NVILA-Lite 1B](https://huggingface.co/turing-motors/Heron-NVILA-Lite-1B)**        | 0.5B     | 45.9                         | 2.92                                | 3.16                     |
 | **Heron NVILA-Lite 2B**        | 1.5B     | 52.8                         | 3.52                                | 3.50                     |
 | **[Heron NVILA-Lite 15B](https://huggingface.co/turing-motors/Heron-NVILA-Lite-15B)**       | 14B      | 59.6                         | 4.2                                 | 3.82                     |
 | [LLaVA-CALM2-SigLIP](https://huggingface.co/cyberagent/llava-calm2-siglip)             | 7B      | 43.3                        | 3.15                                | 3.21                     |
 | [Llama-3-EvoVLM-JP-v2](https://huggingface.co/SakanaAI/Llama-3-EvoVLM-JP-v2)           | 8B      | 39.3                        | 2.92                                | 2.96                     |
 | [VILA-jp](https://huggingface.co/llm-jp/llm-jp-3-vila-14b)                        | 13B     | 57.2                        | 3.69                                | 3.62                     |
 | [Asagi-14B](https://huggingface.co/MIL-UT/Asagi-14B)                      | 13B     | 55.8                        | 3.44                                | 3.84                     |
+| [Sarashina2-Vision-14B](https://huggingface.co/sbintuitions/sarashina2-vision-14b)                      | 13B     | 50.9                        | 4.1                                | 3.43                     |
 | [Qwen2-VL 7B Instruct](https://huggingface.co/Qwen/Qwen2-VL-7B-Instruct)                         | 7B       | 55.5                        | 3.61                                | 3.6                     |
 | GPT-4o                         | -       | 87.6                        | 3.85                                | 3.58                     |
       primaryClass={cs.CV},
       url={https://arxiv.org/abs/2412.04468},
 }
+@inproceedings{maeda2025llm-jp-eval-mm,
+    author = {前田 航希 and 杉浦 一瑳 and 小田 悠介 and 栗田 修平 and 岡崎 直観},
+    month = mar,
+    series = {言語処理学会第31回年次大会 (NLP2025)},
+    title = {{llm-jp-eval-mm: 日本語視覚言語モデルの自動評価基盤}},
+    year = {2025}
+}
+```