File size: 9,138 Bytes
66df9b7
 
 
 
 
 
 
 
 
 
 
 
 
 
b9407b5
66df9b7
e178cff
66df9b7
e178cff
66df9b7
 
 
05fac30
66df9b7
 
05fac30
 
66df9b7
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
33ca0fa
66df9b7
33ca0fa
 
66df9b7
 
 
 
 
 
 
 
 
5e86f83
66df9b7
76e8d15
66df9b7
 
36eb57f
e178cff
66df9b7
 
05fac30
e178cff
 
 
66df9b7
 
 
 
36eb57f
66df9b7
 
 
 
 
 
 
 
 
 
 
 
e087902
66df9b7
628d449
66df9b7
4ebe651
 
 
3e8122a
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
---
license: apache-2.0
license_link: https://huggingface.co/Qwen/Qwen2.5-1.5B-Instruct/blob/main/LICENSE
language:
- ja
- en
tags:
- vila
- nvila
- conversational
- multimodal
base_model:
- Qwen/Qwen2.5-1.5B-Instruct
- Efficient-Large-Model/paligemma-siglip-so400m-patch14-448
pipeline_tag: image-text-to-text
---
# Heron-NVILA-Lite-2B

Heron-NVILA-Lite-2B is a vision language model trained for Japanese, based on the [NVILA](https://arxiv.org/abs/2412.04468)-Lite architecture.

## Model Overview

* **Developer**: [Turing Inc.](https://www.turing-motors.com/)
* **Vision Encoder**: [paligemma-siglip-so400m-patch14-448](https://huggingface.co/Efficient-Large-Model/paligemma-siglip-so400m-patch14-448)
* **Projector**: mlp_downsample_2x2_fix
* **LLM**: [Qwen2.5-1.5B-Instruct](https://huggingface.co/Qwen/Qwen2.5-1.5B-Instruct)
* **Supported Languages**: Japanese, English

## Setup

```bash
# I have confirmed that 4.46.0 and 4.49.0 also work. Other versions of Transformer may also work, but I have not tested them.
pip install transformers==4.45.0 accelerate opencv-python torchvision einops pillow
pip install git+https://github.com/bfshi/scaling_on_scales.git
```

## Usage

```python
from transformers import AutoConfig, AutoModel

model_path = "turing-motors/Heron-NVILA-Lite-2B"

# you can use config
config = AutoConfig.from_pretrained(model_path, trust_remote_code=True)
model = AutoModel.from_config(config, trust_remote_code=True, device_map="auto")

# or directly from_pretrained
model = AutoModel.from_pretrained(model_path, trust_remote_code=True, device_map="auto")

# show chat_template
print(model.tokenizer.chat_template)

# examples generate with raw text
response = model.generate_content(["ใ“ใ‚“ใซใกใฏ"])
print(response)
print("---" * 40)

# examples generate with text + image
from PIL import Image
import requests
url = "http://images.cocodataset.org/val2017/000000039769.jpg"
image = Image.open(requests.get(url, stream=True).raw).convert("RGB")
response = model.generate_content([image, "็”ปๅƒใ‚’่ชฌๆ˜Žใ—ใฆใใ ใ•ใ„ใ€‚"])
print(response)
print("---" * 40)

# examples generate using generation_config
from PIL import Image
import requests
from transformers import GenerationConfig
generation_config = {
    "max_new_tokens": 512,
    "temperature": 0.5,
    "do_sample": True,
}
generation_config = GenerationConfig(**generation_config)
url = "http://images.cocodataset.org/val2017/000000039769.jpg"
image = Image.open(requests.get(url, stream=True).raw).convert("RGB")
response = model.generate_content(
    [image, "็”ปๅƒใ‚’่ชฌๆ˜Žใ—ใฆใใ ใ•ใ„ใ€‚"],
    generation_config=generation_config
)
print(response)
print("---" * 40)

# examples generate with text + image + text + image + text
from PIL import Image
import requests
url_list = [
    "https://images.unsplash.com/photo-1694831404826-3400c48c188d?q=80&w=2070&auto=format&fit=crop&ixlib=rb-4.0.3&ixid=M3wxMjA3fDB8MHxwaG90by1wYWdlfHx8fGVufDB8fHx8fA%3D%3D",
    "https://images.unsplash.com/photo-1693240876439-473af88b4ed7?q=80&w=1974&auto=format&fit=crop&ixlib=rb-4.0.3&ixid=M3wxMjA3fDB8MHxwaG90by1wYWdlfHx8fGVufDB8fHx8fA%3D%3D"
]
images = [
   Image.open(requests.get(url, stream=True).raw).convert("RGB") for url in url_list
]
response = model.generate_content([
    images[0],
    "ใ“ใ‚Œใฏๆ—ฅๆœฌใฎ็”ปๅƒใงใ™",
    images[1],
    "ใ“ใ‚Œใฏใ‚ชใƒผใ‚นใƒˆใƒชใ‚ขใฎ็”ปๅƒใงใ™",
    "ๅ„็”ปๅƒใฎ้•ใ„ใ‚’่ชฌๆ˜Žใ—ใฆ"])
print(response)
print("---" * 40)
```

## Training Summary

| Stage  | Training                      | Data Sources                  | Samples     |
|--------|-------------------------------|-------------------------------|-------------|
| Stage1 | Projector                     | [Japanese image text pairs](https://gitlab.llm-jp.nii.ac.jp/datasets/llm-jp-japanese-image-text-pairs), [LLaVA-Pretrain](https://huggingface.co/datasets/liuhaotian/LLaVA-Pretrain)                          | 1.1M      |
| Stage2 | Projector, LLM                | Filtered [MOMIJI](https://huggingface.co/datasets/turing-motors/MOMIJI) (CC-MAIN-2024-46, CC-MAIN-2024-51, CC-MAIN-2025-05)  | 13M     |
|        |                               | [Japanese image text pairs (subset)](https://gitlab.llm-jp.nii.ac.jp/datasets/llm-jp-japanese-image-text-pairs), [Japanese interleaved data (subset)](https://gitlab.llm-jp.nii.ac.jp/datasets/llm-jp-japanese-interleaved-data), [mmc4-core (subset)](https://github.com/allenai/mmc4), [coyo-700m (subset)](https://huggingface.co/datasets/kakaobrain/coyo-700m), [wikipedia_ja](https://huggingface.co/datasets/turing-motors/Wikipedia-Vision-JA), [llava_pretrain_ja](https://huggingface.co/datasets/turing-motors/LLaVA-Pretrain-JA), [stair_captions](http://captions.stair.center/)  | 20M     |
| Stage3 | Vision Encoder, Projector, LLM | [llava-instruct-v1_5-en-subset-358k](https://huggingface.co/datasets/llm-jp/llava-instruct-v1_5-en-subset-358k), [llava-instruct-ja](https://huggingface.co/datasets/llm-jp/llava-instruct-ja), [japanese-photos-conv](https://huggingface.co/datasets/llm-jp/japanese-photos-conversation), [ja-vg-vqa](https://huggingface.co/datasets/llm-jp/ja-vg-vqa-conversation), [synthdog-ja (subset)](https://huggingface.co/datasets/naver-clova-ix/synthdog-ja), [ai2d](https://huggingface.co/datasets/lmms-lab/ai2d), [synthdog-en](https://huggingface.co/datasets/naver-clova-ix/synthdog-en), [sherlock](https://github.com/allenai/sherlock)    | 1.1M      |

## Evaluation

I used [llm-jp-eval-mm](https://github.com/llm-jp/llm-jp-eval-mm) for this evaluation. Scores for models other than Heron-NVILA-Lite and Sarashina2-Vision-14B were taken from [llm-jp-eval-mm leaderboard](https://llm-jp.github.io/llm-jp-eval-mm/) as of March 2025 and the [Asagi website](https://uehara-mech.github.io/asagi-vlm?v=1). Heron-NVILA-Lite and Sarashina2-Vision-14B were evaluated using llm-as-a-judge with "gpt-4o-2024-05-13". Sarashina2-Vision-14B was evaluated on the [official blog](https://www.sbintuitions.co.jp/blog/entry/2025/03/17/111703) using "gpt-4o-2024-08-06"; please note that due to differing evaluation conditions, the results for Sarashina2-Vision-14B should be treated as reference only.

| Model                          | LLM Size | Heron-Bench overall LLM (%) | JA-VLM-Bench-In-the-Wild LLM (/5.0) | JA-VG-VQA-500 LLM (/5.0) |
|--------------------------------|----------|------------------------------|-------------------------------------|--------------------------|
| **[Heron-NVILA-Lite-1B](https://huggingface.co/turing-motors/Heron-NVILA-Lite-1B)**        | 0.5B     | 45.9                         | 2.92                                | 3.16                     |
| **Heron-NVILA-Lite-2B**        | 1.5B     | 52.8                         | 3.52                                | 3.50                     |
| **[Heron-NVILA-Lite-15B](https://huggingface.co/turing-motors/Heron-NVILA-Lite-15B)**       | 14B      | 59.6                         | 4.2                                 | 3.82                     |
| [LLaVA-CALM2-SigLIP](https://huggingface.co/cyberagent/llava-calm2-siglip)             | 7B      | 43.3                        | 3.15                                | 3.21                     |
| [Llama-3-EvoVLM-JP-v2](https://huggingface.co/SakanaAI/Llama-3-EvoVLM-JP-v2)           | 8B      | 39.3                        | 2.92                                | 2.96                     |
| [VILA-jp](https://huggingface.co/llm-jp/llm-jp-3-vila-14b)                        | 13B     | 57.2                        | 3.69                                | 3.62                     |
| [Asagi-14B](https://huggingface.co/MIL-UT/Asagi-14B)                      | 13B     | 55.8                        | 3.44                                | 3.84                     |
| [Sarashina2-Vision-14B](https://huggingface.co/sbintuitions/sarashina2-vision-14b)                      | 13B     | 50.9                        | 4.1                                | 3.43                     |
| [Qwen2-VL 7B Instruct](https://huggingface.co/Qwen/Qwen2-VL-7B-Instruct)                         | 7B       | 55.5                        | 3.61                                | 3.6                     |
| GPT-4o                         | -       | 87.6                        | 3.85                                | 3.58                     |

## Risks and Limitations

This model is experimental and has not been thoroughly calibrated for ethical compliance or legal standards. Caution is advised for sensitive applications.

## License

- Model weights are licensed under [Apache License 2.0](https://huggingface.co/Qwen/Qwen2.5-1.5B-Instruct/blob/main/LICENSE).
- Users must comply with  [OpenAI terms of use](https://openai.com/policies/terms-of-use) due to the inclusion of GPT-4-generated synthetic data.

## Acknowledgements

This model is based on results obtained from a project, JPNP20017, subsidized by the New Energy and Industrial Technology Development Organization (NEDO).

I would like to acknowledge the use of the following open-source repositories:

- [VILA](https://github.com/NVlabs/VILA)
- [llm-jp-eval-mm](https://github.com/llm-jp/llm-jp-eval-mm)