|
--- |
|
license: apache-2.0 |
|
license_link: https://huggingface.co/Qwen/Qwen2.5-1.5B-Instruct/blob/main/LICENSE |
|
language: |
|
- ja |
|
- en |
|
tags: |
|
- vila |
|
- nvila |
|
- conversational |
|
- multimodal |
|
base_model: |
|
- Qwen/Qwen2.5-1.5B-Instruct |
|
- Efficient-Large-Model/paligemma-siglip-so400m-patch14-448 |
|
pipeline_tag: image-text-to-text |
|
--- |
|
# Heron-NVILA-Lite-2B |
|
|
|
Heron-NVILA-Lite-2B is a vision language model trained for Japanese, based on the [NVILA](https://arxiv.org/abs/2412.04468)-Lite architecture. |
|
|
|
## Model Overview |
|
|
|
* **Developer**: [Turing Inc.](https://www.turing-motors.com/) |
|
* **Vision Encoder**: [paligemma-siglip-so400m-patch14-448](https://huggingface.co/Efficient-Large-Model/paligemma-siglip-so400m-patch14-448) |
|
* **Projector**: mlp_downsample_2x2_fix |
|
* **LLM**: [Qwen2.5-1.5B-Instruct](https://huggingface.co/Qwen/Qwen2.5-1.5B-Instruct) |
|
* **Supported Languages**: Japanese, English |
|
|
|
## Setup |
|
|
|
```bash |
|
# I have confirmed that 4.46.0 and 4.49.0 also work. Other versions of Transformer may also work, but I have not tested them. |
|
pip install transformers==4.45.0 accelerate opencv-python torchvision einops pillow |
|
pip install git+https://github.com/bfshi/scaling_on_scales.git |
|
``` |
|
|
|
## Usage |
|
|
|
```python |
|
from transformers import AutoConfig, AutoModel |
|
|
|
model_path = "turing-motors/Heron-NVILA-Lite-2B" |
|
|
|
# you can use config |
|
config = AutoConfig.from_pretrained(model_path, trust_remote_code=True) |
|
model = AutoModel.from_config(config, trust_remote_code=True, device_map="auto") |
|
|
|
# or directly from_pretrained |
|
model = AutoModel.from_pretrained(model_path, trust_remote_code=True, device_map="auto") |
|
|
|
# show chat_template |
|
print(model.tokenizer.chat_template) |
|
|
|
# examples generate with raw text |
|
response = model.generate_content(["ใใใซใกใฏ"]) |
|
print(response) |
|
print("---" * 40) |
|
|
|
# examples generate with text + image |
|
from PIL import Image |
|
import requests |
|
url = "http://images.cocodataset.org/val2017/000000039769.jpg" |
|
image = Image.open(requests.get(url, stream=True).raw).convert("RGB") |
|
response = model.generate_content([image, "็ปๅใ่ชฌๆใใฆใใ ใใใ"]) |
|
print(response) |
|
print("---" * 40) |
|
|
|
# examples generate using generation_config |
|
from PIL import Image |
|
import requests |
|
from transformers import GenerationConfig |
|
generation_config = { |
|
"max_new_tokens": 512, |
|
"temperature": 0.5, |
|
"do_sample": True, |
|
} |
|
generation_config = GenerationConfig(**generation_config) |
|
url = "http://images.cocodataset.org/val2017/000000039769.jpg" |
|
image = Image.open(requests.get(url, stream=True).raw).convert("RGB") |
|
response = model.generate_content( |
|
[image, "็ปๅใ่ชฌๆใใฆใใ ใใใ"], |
|
generation_config=generation_config |
|
) |
|
print(response) |
|
print("---" * 40) |
|
|
|
# examples generate with text + image + text + image + text |
|
from PIL import Image |
|
import requests |
|
url_list = [ |
|
"https://images.unsplash.com/photo-1694831404826-3400c48c188d?q=80&w=2070&auto=format&fit=crop&ixlib=rb-4.0.3&ixid=M3wxMjA3fDB8MHxwaG90by1wYWdlfHx8fGVufDB8fHx8fA%3D%3D", |
|
"https://images.unsplash.com/photo-1693240876439-473af88b4ed7?q=80&w=1974&auto=format&fit=crop&ixlib=rb-4.0.3&ixid=M3wxMjA3fDB8MHxwaG90by1wYWdlfHx8fGVufDB8fHx8fA%3D%3D" |
|
] |
|
images = [ |
|
Image.open(requests.get(url, stream=True).raw).convert("RGB") for url in url_list |
|
] |
|
response = model.generate_content([ |
|
images[0], |
|
"ใใใฏๆฅๆฌใฎ็ปๅใงใ", |
|
images[1], |
|
"ใใใฏใชใผในใใชใขใฎ็ปๅใงใ", |
|
"ๅ็ปๅใฎ้ใใ่ชฌๆใใฆ"]) |
|
print(response) |
|
print("---" * 40) |
|
``` |
|
|
|
## Training Summary |
|
|
|
| Stage | Training | Data Sources | Samples | |
|
|--------|-------------------------------|-------------------------------|-------------| |
|
| Stage1 | Projector | [Japanese image text pairs](https://gitlab.llm-jp.nii.ac.jp/datasets/llm-jp-japanese-image-text-pairs), [LLaVA-Pretrain](https://huggingface.co/datasets/liuhaotian/LLaVA-Pretrain) | 1.1M | |
|
| Stage2 | Projector, LLM | Filtered [MOMIJI](https://huggingface.co/datasets/turing-motors/MOMIJI) (CC-MAIN-2024-46, CC-MAIN-2024-51, CC-MAIN-2025-05) | 13M | |
|
| | | [Japanese image text pairs (subset)](https://gitlab.llm-jp.nii.ac.jp/datasets/llm-jp-japanese-image-text-pairs), [Japanese interleaved data (subset)](https://gitlab.llm-jp.nii.ac.jp/datasets/llm-jp-japanese-interleaved-data), [mmc4-core (subset)](https://github.com/allenai/mmc4), [coyo-700m (subset)](https://huggingface.co/datasets/kakaobrain/coyo-700m), [wikipedia_ja](https://huggingface.co/datasets/turing-motors/Wikipedia-Vision-JA), [llava_pretrain_ja](https://huggingface.co/datasets/turing-motors/LLaVA-Pretrain-JA), [stair_captions](http://captions.stair.center/) | 20M | |
|
| Stage3 | Vision Encoder, Projector, LLM | [llava-instruct-v1_5-en-subset-358k](https://huggingface.co/datasets/llm-jp/llava-instruct-v1_5-en-subset-358k), [llava-instruct-ja](https://huggingface.co/datasets/llm-jp/llava-instruct-ja), [japanese-photos-conv](https://huggingface.co/datasets/llm-jp/japanese-photos-conversation), [ja-vg-vqa](https://huggingface.co/datasets/llm-jp/ja-vg-vqa-conversation), [synthdog-ja (subset)](https://huggingface.co/datasets/naver-clova-ix/synthdog-ja), [ai2d](https://huggingface.co/datasets/lmms-lab/ai2d), [synthdog-en](https://huggingface.co/datasets/naver-clova-ix/synthdog-en), [sherlock](https://github.com/allenai/sherlock) | 1.1M | |
|
|
|
## Evaluation |
|
|
|
I used [llm-jp-eval-mm](https://github.com/llm-jp/llm-jp-eval-mm) for this evaluation. Scores for models other than Heron-NVILA-Lite and Sarashina2-Vision-14B were taken from [llm-jp-eval-mm leaderboard](https://llm-jp.github.io/llm-jp-eval-mm/) as of March 2025 and the [Asagi website](https://uehara-mech.github.io/asagi-vlm?v=1). Heron-NVILA-Lite and Sarashina2-Vision-14B were evaluated using llm-as-a-judge with "gpt-4o-2024-05-13". Sarashina2-Vision-14B was evaluated on the [official blog](https://www.sbintuitions.co.jp/blog/entry/2025/03/17/111703) using "gpt-4o-2024-08-06"; please note that due to differing evaluation conditions, the results for Sarashina2-Vision-14B should be treated as reference only. |
|
|
|
| Model | LLM Size | Heron-Bench overall LLM (%) | JA-VLM-Bench-In-the-Wild LLM (/5.0) | JA-VG-VQA-500 LLM (/5.0) | |
|
|--------------------------------|----------|------------------------------|-------------------------------------|--------------------------| |
|
| **[Heron-NVILA-Lite-1B](https://huggingface.co/turing-motors/Heron-NVILA-Lite-1B)** | 0.5B | 45.9 | 2.92 | 3.16 | |
|
| **Heron-NVILA-Lite-2B** | 1.5B | 52.8 | 3.52 | 3.50 | |
|
| **[Heron-NVILA-Lite-15B](https://huggingface.co/turing-motors/Heron-NVILA-Lite-15B)** | 14B | 59.6 | 4.2 | 3.82 | |
|
| [LLaVA-CALM2-SigLIP](https://huggingface.co/cyberagent/llava-calm2-siglip) | 7B | 43.3 | 3.15 | 3.21 | |
|
| [Llama-3-EvoVLM-JP-v2](https://huggingface.co/SakanaAI/Llama-3-EvoVLM-JP-v2) | 8B | 39.3 | 2.92 | 2.96 | |
|
| [VILA-jp](https://huggingface.co/llm-jp/llm-jp-3-vila-14b) | 13B | 57.2 | 3.69 | 3.62 | |
|
| [Asagi-14B](https://huggingface.co/MIL-UT/Asagi-14B) | 13B | 55.8 | 3.44 | 3.84 | |
|
| [Sarashina2-Vision-14B](https://huggingface.co/sbintuitions/sarashina2-vision-14b) | 13B | 50.9 | 4.1 | 3.43 | |
|
| [Qwen2-VL 7B Instruct](https://huggingface.co/Qwen/Qwen2-VL-7B-Instruct) | 7B | 55.5 | 3.61 | 3.6 | |
|
| GPT-4o | - | 87.6 | 3.85 | 3.58 | |
|
|
|
## Risks and Limitations |
|
|
|
This model is experimental and has not been thoroughly calibrated for ethical compliance or legal standards. Caution is advised for sensitive applications. |
|
|
|
## License |
|
|
|
- Model weights are licensed under [Apache License 2.0](https://huggingface.co/Qwen/Qwen2.5-1.5B-Instruct/blob/main/LICENSE). |
|
- Users must comply with [OpenAI terms of use](https://openai.com/policies/terms-of-use) due to the inclusion of GPT-4-generated synthetic data. |
|
|
|
## Acknowledgements |
|
|
|
This model is based on results obtained from a project, JPNP20017, subsidized by the New Energy and Industrial Technology Development Organization (NEDO). |
|
|
|
I would like to acknowledge the use of the following open-source repositories: |
|
|
|
- [VILA](https://github.com/NVlabs/VILA) |
|
- [llm-jp-eval-mm](https://github.com/llm-jp/llm-jp-eval-mm) |