File size: 9,138 Bytes
66df9b7 b9407b5 66df9b7 e178cff 66df9b7 e178cff 66df9b7 05fac30 66df9b7 05fac30 66df9b7 33ca0fa 66df9b7 33ca0fa 66df9b7 5e86f83 66df9b7 76e8d15 66df9b7 36eb57f e178cff 66df9b7 05fac30 e178cff 66df9b7 36eb57f 66df9b7 e087902 66df9b7 628d449 66df9b7 4ebe651 3e8122a |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 |
---
license: apache-2.0
license_link: https://huggingface.co/Qwen/Qwen2.5-1.5B-Instruct/blob/main/LICENSE
language:
- ja
- en
tags:
- vila
- nvila
- conversational
- multimodal
base_model:
- Qwen/Qwen2.5-1.5B-Instruct
- Efficient-Large-Model/paligemma-siglip-so400m-patch14-448
pipeline_tag: image-text-to-text
---
# Heron-NVILA-Lite-2B
Heron-NVILA-Lite-2B is a vision language model trained for Japanese, based on the [NVILA](https://arxiv.org/abs/2412.04468)-Lite architecture.
## Model Overview
* **Developer**: [Turing Inc.](https://www.turing-motors.com/)
* **Vision Encoder**: [paligemma-siglip-so400m-patch14-448](https://huggingface.co/Efficient-Large-Model/paligemma-siglip-so400m-patch14-448)
* **Projector**: mlp_downsample_2x2_fix
* **LLM**: [Qwen2.5-1.5B-Instruct](https://huggingface.co/Qwen/Qwen2.5-1.5B-Instruct)
* **Supported Languages**: Japanese, English
## Setup
```bash
# I have confirmed that 4.46.0 and 4.49.0 also work. Other versions of Transformer may also work, but I have not tested them.
pip install transformers==4.45.0 accelerate opencv-python torchvision einops pillow
pip install git+https://github.com/bfshi/scaling_on_scales.git
```
## Usage
```python
from transformers import AutoConfig, AutoModel
model_path = "turing-motors/Heron-NVILA-Lite-2B"
# you can use config
config = AutoConfig.from_pretrained(model_path, trust_remote_code=True)
model = AutoModel.from_config(config, trust_remote_code=True, device_map="auto")
# or directly from_pretrained
model = AutoModel.from_pretrained(model_path, trust_remote_code=True, device_map="auto")
# show chat_template
print(model.tokenizer.chat_template)
# examples generate with raw text
response = model.generate_content(["ใใใซใกใฏ"])
print(response)
print("---" * 40)
# examples generate with text + image
from PIL import Image
import requests
url = "http://images.cocodataset.org/val2017/000000039769.jpg"
image = Image.open(requests.get(url, stream=True).raw).convert("RGB")
response = model.generate_content([image, "็ปๅใ่ชฌๆใใฆใใ ใใใ"])
print(response)
print("---" * 40)
# examples generate using generation_config
from PIL import Image
import requests
from transformers import GenerationConfig
generation_config = {
"max_new_tokens": 512,
"temperature": 0.5,
"do_sample": True,
}
generation_config = GenerationConfig(**generation_config)
url = "http://images.cocodataset.org/val2017/000000039769.jpg"
image = Image.open(requests.get(url, stream=True).raw).convert("RGB")
response = model.generate_content(
[image, "็ปๅใ่ชฌๆใใฆใใ ใใใ"],
generation_config=generation_config
)
print(response)
print("---" * 40)
# examples generate with text + image + text + image + text
from PIL import Image
import requests
url_list = [
"https://images.unsplash.com/photo-1694831404826-3400c48c188d?q=80&w=2070&auto=format&fit=crop&ixlib=rb-4.0.3&ixid=M3wxMjA3fDB8MHxwaG90by1wYWdlfHx8fGVufDB8fHx8fA%3D%3D",
"https://images.unsplash.com/photo-1693240876439-473af88b4ed7?q=80&w=1974&auto=format&fit=crop&ixlib=rb-4.0.3&ixid=M3wxMjA3fDB8MHxwaG90by1wYWdlfHx8fGVufDB8fHx8fA%3D%3D"
]
images = [
Image.open(requests.get(url, stream=True).raw).convert("RGB") for url in url_list
]
response = model.generate_content([
images[0],
"ใใใฏๆฅๆฌใฎ็ปๅใงใ",
images[1],
"ใใใฏใชใผในใใชใขใฎ็ปๅใงใ",
"ๅ็ปๅใฎ้ใใ่ชฌๆใใฆ"])
print(response)
print("---" * 40)
```
## Training Summary
| Stage | Training | Data Sources | Samples |
|--------|-------------------------------|-------------------------------|-------------|
| Stage1 | Projector | [Japanese image text pairs](https://gitlab.llm-jp.nii.ac.jp/datasets/llm-jp-japanese-image-text-pairs), [LLaVA-Pretrain](https://huggingface.co/datasets/liuhaotian/LLaVA-Pretrain) | 1.1M |
| Stage2 | Projector, LLM | Filtered [MOMIJI](https://huggingface.co/datasets/turing-motors/MOMIJI) (CC-MAIN-2024-46, CC-MAIN-2024-51, CC-MAIN-2025-05) | 13M |
| | | [Japanese image text pairs (subset)](https://gitlab.llm-jp.nii.ac.jp/datasets/llm-jp-japanese-image-text-pairs), [Japanese interleaved data (subset)](https://gitlab.llm-jp.nii.ac.jp/datasets/llm-jp-japanese-interleaved-data), [mmc4-core (subset)](https://github.com/allenai/mmc4), [coyo-700m (subset)](https://huggingface.co/datasets/kakaobrain/coyo-700m), [wikipedia_ja](https://huggingface.co/datasets/turing-motors/Wikipedia-Vision-JA), [llava_pretrain_ja](https://huggingface.co/datasets/turing-motors/LLaVA-Pretrain-JA), [stair_captions](http://captions.stair.center/) | 20M |
| Stage3 | Vision Encoder, Projector, LLM | [llava-instruct-v1_5-en-subset-358k](https://huggingface.co/datasets/llm-jp/llava-instruct-v1_5-en-subset-358k), [llava-instruct-ja](https://huggingface.co/datasets/llm-jp/llava-instruct-ja), [japanese-photos-conv](https://huggingface.co/datasets/llm-jp/japanese-photos-conversation), [ja-vg-vqa](https://huggingface.co/datasets/llm-jp/ja-vg-vqa-conversation), [synthdog-ja (subset)](https://huggingface.co/datasets/naver-clova-ix/synthdog-ja), [ai2d](https://huggingface.co/datasets/lmms-lab/ai2d), [synthdog-en](https://huggingface.co/datasets/naver-clova-ix/synthdog-en), [sherlock](https://github.com/allenai/sherlock) | 1.1M |
## Evaluation
I used [llm-jp-eval-mm](https://github.com/llm-jp/llm-jp-eval-mm) for this evaluation. Scores for models other than Heron-NVILA-Lite and Sarashina2-Vision-14B were taken from [llm-jp-eval-mm leaderboard](https://llm-jp.github.io/llm-jp-eval-mm/) as of March 2025 and the [Asagi website](https://uehara-mech.github.io/asagi-vlm?v=1). Heron-NVILA-Lite and Sarashina2-Vision-14B were evaluated using llm-as-a-judge with "gpt-4o-2024-05-13". Sarashina2-Vision-14B was evaluated on the [official blog](https://www.sbintuitions.co.jp/blog/entry/2025/03/17/111703) using "gpt-4o-2024-08-06"; please note that due to differing evaluation conditions, the results for Sarashina2-Vision-14B should be treated as reference only.
| Model | LLM Size | Heron-Bench overall LLM (%) | JA-VLM-Bench-In-the-Wild LLM (/5.0) | JA-VG-VQA-500 LLM (/5.0) |
|--------------------------------|----------|------------------------------|-------------------------------------|--------------------------|
| **[Heron-NVILA-Lite-1B](https://huggingface.co/turing-motors/Heron-NVILA-Lite-1B)** | 0.5B | 45.9 | 2.92 | 3.16 |
| **Heron-NVILA-Lite-2B** | 1.5B | 52.8 | 3.52 | 3.50 |
| **[Heron-NVILA-Lite-15B](https://huggingface.co/turing-motors/Heron-NVILA-Lite-15B)** | 14B | 59.6 | 4.2 | 3.82 |
| [LLaVA-CALM2-SigLIP](https://huggingface.co/cyberagent/llava-calm2-siglip) | 7B | 43.3 | 3.15 | 3.21 |
| [Llama-3-EvoVLM-JP-v2](https://huggingface.co/SakanaAI/Llama-3-EvoVLM-JP-v2) | 8B | 39.3 | 2.92 | 2.96 |
| [VILA-jp](https://huggingface.co/llm-jp/llm-jp-3-vila-14b) | 13B | 57.2 | 3.69 | 3.62 |
| [Asagi-14B](https://huggingface.co/MIL-UT/Asagi-14B) | 13B | 55.8 | 3.44 | 3.84 |
| [Sarashina2-Vision-14B](https://huggingface.co/sbintuitions/sarashina2-vision-14b) | 13B | 50.9 | 4.1 | 3.43 |
| [Qwen2-VL 7B Instruct](https://huggingface.co/Qwen/Qwen2-VL-7B-Instruct) | 7B | 55.5 | 3.61 | 3.6 |
| GPT-4o | - | 87.6 | 3.85 | 3.58 |
## Risks and Limitations
This model is experimental and has not been thoroughly calibrated for ethical compliance or legal standards. Caution is advised for sensitive applications.
## License
- Model weights are licensed under [Apache License 2.0](https://huggingface.co/Qwen/Qwen2.5-1.5B-Instruct/blob/main/LICENSE).
- Users must comply with [OpenAI terms of use](https://openai.com/policies/terms-of-use) due to the inclusion of GPT-4-generated synthetic data.
## Acknowledgements
This model is based on results obtained from a project, JPNP20017, subsidized by the New Energy and Industrial Technology Development Organization (NEDO).
I would like to acknowledge the use of the following open-source repositories:
- [VILA](https://github.com/NVlabs/VILA)
- [llm-jp-eval-mm](https://github.com/llm-jp/llm-jp-eval-mm) |