|
--- |
|
license: apache-2.0 |
|
license_link: https://huggingface.co/Qwen/Qwen2.5-1.5B-Instruct/blob/main/LICENSE |
|
language: |
|
- ja |
|
- en |
|
tags: |
|
- vila |
|
- nvila |
|
- conversational |
|
- multimodal |
|
base_model: |
|
- Qwen/Qwen2.5-1.5B-Instruct |
|
- Efficient-Large-Model/paligemma-siglip-so400m-patch14-448 |
|
--- |
|
# Heron NVILA-Lite 2B |
|
|
|
Heron NVILA-Lite 2B is a vision language model trained for Japanese, based on the [NVILA](https://arxiv.org/abs/2412.04468)-Lite architecture. |
|
|
|
## Model Overview |
|
|
|
* **Developed by**: [Turing Inc.](https://www.turing-motors.com/) |
|
* **Vision Encoder**: [paligemma-siglip-so400m-patch14-448](https://huggingface.co/Efficient-Large-Model/paligemma-siglip-so400m-patch14-448) |
|
* **Projector**: mlp_downsample_2x2_fix |
|
* **Language Model**: [Qwen2.5-1.5B-Instruct](https://huggingface.co/Qwen/Qwen2.5-1.5B-Instruct) |
|
* **Language(s)**: Japanese, English |
|
|
|
## Setup |
|
|
|
```bash |
|
# I have confirmed that 4.46.0 and 4.49.0 also work. Other versions of Transformer may also work, but I have not tested them. |
|
pip install transformers==4.45.0 accelerate opencv-python torchvision einops pillow |
|
pip install git+https://github.com/bfshi/scaling_on_scales.git |
|
``` |
|
|
|
## Usage |
|
|
|
```python |
|
from transformers import AutoConfig, AutoModel |
|
|
|
model_path = "turing-motors/Heron-NVILA-Lite-2B" |
|
|
|
# you can use config |
|
config = AutoConfig.from_pretrained(model_path, trust_remote_code=True) |
|
model = AutoModel.from_config(config, trust_remote_code=True, device_map="auto") |
|
|
|
# or directly from_pretrained |
|
model = AutoModel.from_pretrained(model_path, trust_remote_code=True, device_map="auto") |
|
|
|
# show chat_template |
|
print(model.tokenizer.chat_template) |
|
|
|
# examples generate with raw text |
|
response = model.generate_content(["こんにちは"]) |
|
print(response) |
|
print("---" * 40) |
|
|
|
# examples generate with text + image |
|
from PIL import Image |
|
import requests |
|
url = "http://images.cocodataset.org/val2017/000000039769.jpg" |
|
image = Image.open(requests.get(url, stream=True).raw).convert("RGB") |
|
response = model.generate_content([image, "画像を説明してください。"]) |
|
print(response) |
|
print("---" * 40) |
|
|
|
# examples generate using generation_config |
|
from PIL import Image |
|
import requests |
|
from transformers import GenerationConfig |
|
generation_config = { |
|
"max_new_tokens": 512, |
|
"temperature": 0.5, |
|
"do_sample": True, |
|
} |
|
generation_config = GenerationConfig(**generation_config) |
|
url = "http://images.cocodataset.org/val2017/000000039769.jpg" |
|
image = Image.open(requests.get(url, stream=True).raw).convert("RGB") |
|
response = model.generate_content( |
|
[image, "画像を説明してください。"], |
|
generation_config=generation_config |
|
) |
|
print(response) |
|
print("---" * 40) |
|
|
|
# examples generate with text + image + text + image + text |
|
from PIL import Image |
|
import requests |
|
url_list = [ |
|
"https://images.unsplash.com/photo-1694831404826-3400c48c188d?q=80&w=2070&auto=format&fit=crop&ixlib=rb-4.0.3&ixid=M3wxMjA3fDB8MHxwaG90by1wYWdlfHx8fGVufDB8fHx8fA%3D%3D", |
|
"https://images.unsplash.com/photo-1693240876439-473af88b4ed7?q=80&w=1974&auto=format&fit=crop&ixlib=rb-4.0.3&ixid=M3wxMjA3fDB8MHxwaG90by1wYWdlfHx8fGVufDB8fHx8fA%3D%3D" |
|
] |
|
images = [ |
|
Image.open(requests.get(url, stream=True).raw).convert("RGB") for url in url_list |
|
] |
|
response = model.generate_content([ |
|
images[0], |
|
"これは日本の横断歩道の画像です", |
|
images[1], |
|
"これはオーストリアの信号機の画像です", |
|
"各画像に写っている歩行者用信号機の色は何色ですか?"]) |
|
print(response) |
|
print("---" * 40) |
|
``` |
|
|
|
## Training Summary |
|
|
|
| Stage | Training | Data Sources | Samples | |
|
|--------|-------------------------------|-------------------------------|-------------| |
|
| Stage1 | Projector | [Japanese image text pairs](https://gitlab.llm-jp.nii.ac.jp/datasets/llm-jp-japanese-image-text-pairs), [LLaVA-Pretrain](https://huggingface.co/datasets/liuhaotian/LLaVA-Pretrain) | 1.1M | |
|
| Stage2 | Projector, LLM | Filtered MOMIJI 3 snapshots (CC-MAIN-2024-46, CC-MAIN-2024-51, CC-MAIN-2025-05) | 13M | |
|
| | | [Japanese image text pairs (subset)](https://gitlab.llm-jp.nii.ac.jp/datasets/llm-jp-japanese-image-text-pairs), [Japanese interleaved data (subset)](https://gitlab.llm-jp.nii.ac.jp/datasets/llm-jp-japanese-interleaved-data), [mmc4-core (subset)](https://github.com/allenai/mmc4), [coyo-700m (subset)](https://huggingface.co/datasets/kakaobrain/coyo-700m), [wikipedia_ja](https://huggingface.co/datasets/turing-motors/Wikipedia-Vision-JA), [llava_pretrain_ja](https://huggingface.co/datasets/turing-motors/LLaVA-Pretrain-JA), [stair_captions](http://captions.stair.center/) | 20M | |
|
| Stage3 | Vision Encoder, Projector, LLM | [llava-instruct-v1_5-en-subset-358k](https://huggingface.co/datasets/llm-jp/llava-instruct-v1_5-en-subset-358k), [llava-instruct-ja](https://huggingface.co/datasets/llm-jp/llava-instruct-ja), [japanese-photos-conv](https://huggingface.co/datasets/llm-jp/japanese-photos-conversation), [ja-vg-vqa](https://huggingface.co/datasets/llm-jp/ja-vg-vqa-conversation), [synthdog-ja (subset)](https://huggingface.co/datasets/naver-clova-ix/synthdog-ja), [ai2d](https://huggingface.co/datasets/lmms-lab/ai2d), [synthdog-en](https://huggingface.co/datasets/naver-clova-ix/synthdog-en), [sherlock](https://github.com/allenai/sherlock) | 1.4M | |
|
|
|
## Evaluation |
|
I used [llm-jp-eval-mm](https://github.com/llm-jp/llm-jp-eval-mm) for this evaluation. All scores other than our models are taken from [llm-jp-eval-mm leaderboard](https://llm-jp.github.io/llm-jp-eval-mm/) and the [Asagi website](https://uehara-mech.github.io/asagi-vlm?v=1). |
|
|
|
| Model | LLM Size | Heron-Bench overall LLM (%) | JA-VLM-Bench-In-the-Wild LLM (/5.0) | JA-VG-VQA-500 LLM (/5.0) | |
|
|--------------------------------|---------|-----------------------------|-------------------------------------|--------------------------| |
|
| **Heron NVILA-Lite 2B** | 1.5B | 52.8 | 3.52 | 3.50 | |
|
| **Heron NVILA-Lite 15B** | 14B | 59.6 | 4.2 | 3.82 | |
|
| [LLaVA-CALM2-SigLIP](https://huggingface.co/cyberagent/llava-calm2-siglip) | 7B | 43.3 | 3.15 | 3.21 | |
|
| [Llama-3-EvoVLM-JP-v2](https://huggingface.co/SakanaAI/Llama-3-EvoVLM-JP-v2) | 8B | 39.3 | 2.92 | 2.96 | |
|
| [VILA-jp](https://huggingface.co/llm-jp/llm-jp-3-vila-14b) | 13B | 57.2 | 3.69 | 3.62 | |
|
| [Asagi-14B](https://huggingface.co/MIL-UT/Asagi-14B) | 13B | 55.8 | 3.44 | 3.84 | |
|
| [Qwen2-VL 7B Instruct](https://huggingface.co/Qwen/Qwen2-VL-7B-Instruct) | 7B | 55.5 | 3.61 | 3.6 | |
|
| GPT-4o | - | 87.6 | 3.85 | 3.58 | |
|
|
|
## Risks and Limitations |
|
|
|
This model is experimental and has not been thoroughly calibrated for ethical compliance or legal standards. Caution is advised for sensitive applications. |
|
|
|
## License |
|
|
|
- Model weights are licensed under [Apache License 2.0](https://huggingface.co/Qwen/Qwen2.5-1.5B-Instruct/blob/main/LICENSE). |
|
- Users must comply with [OpenAI terms of use](https://openai.com/policies/terms-of-use) due to the inclusion of GPT-4-generated synthetic data. |
|
|
|
## How to cite |
|
|
|
```bibtex |
|
@misc{HeronNVILALite2B, |
|
title = {Heron NVILA-Lite 2B}, |
|
author = {Shingo Yokoi}, |
|
year = {2025}, |
|
url = {https://huggingface.co/turing-motors/Heron-NVILA-Lite-2B}, |
|
} |
|
``` |
|
|
|
## Citations |
|
|
|
```bibtex |
|
@misc{liu2025nvilaefficientfrontiervisual, |
|
title={NVILA: Efficient Frontier Visual Language Models}, |
|
author={Zhijian Liu and Ligeng Zhu and Baifeng Shi and Zhuoyang Zhang and Yuming Lou and Shang Yang and Haocheng Xi and Shiyi Cao and Yuxian Gu and Dacheng Li and Xiuyu Li and Yunhao Fang and Yukang Chen and Cheng-Yu Hsieh and De-An Huang and An-Chieh Cheng and Vishwesh Nath and Jinyi Hu and Sifei Liu and Ranjay Krishna and Daguang Xu and Xiaolong Wang and Pavlo Molchanov and Jan Kautz and Hongxu Yin and Song Han and Yao Lu}, |
|
year={2025}, |
|
eprint={2412.04468}, |
|
archivePrefix={arXiv}, |
|
primaryClass={cs.CV}, |
|
url={https://arxiv.org/abs/2412.04468}, |
|
} |
|
``` |