Heron-NVILA-Lite-2B / README.md
anonamename's picture
Upload turing-motors/Heron-NVILA-Lite-2B
66df9b7 verified
|
raw
history blame
8.78 kB
metadata
license: apache-2.0
license_link: https://huggingface.co/Qwen/Qwen2.5-1.5B-Instruct/blob/main/LICENSE
language:
  - ja
  - en
tags:
  - vila
  - nvila
  - conversational
  - multimodal
base_model:
  - Qwen/Qwen2.5-1.5B-Instruct
  - Efficient-Large-Model/paligemma-siglip-so400m-patch14-448

Heron NVILA-Lite 2B

Heron NVILA-Lite 2B is a vision language model trained for Japanese, based on the NVILA-Lite architecture.

Model Overview

Setup

# I have confirmed that 4.46.0 and 4.49.0 also work. Other versions of Transformer may also work, but I have not tested them.
pip install transformers==4.45.0 accelerate opencv-python torchvision einops pillow
pip install git+https://github.com/bfshi/scaling_on_scales.git

Usage

from transformers import AutoConfig, AutoModel

model_path = "turing-motors/Heron-NVILA-Lite-2B"

# you can use config
config = AutoConfig.from_pretrained(model_path, trust_remote_code=True)
model = AutoModel.from_config(config, trust_remote_code=True, device_map="auto")

# or directly from_pretrained
model = AutoModel.from_pretrained(model_path, trust_remote_code=True, device_map="auto")

# show chat_template
print(model.tokenizer.chat_template)

# examples generate with raw text
response = model.generate_content(["こんにちは"])
print(response)
print("---" * 40)

# examples generate with text + image
from PIL import Image
import requests
url = "http://images.cocodataset.org/val2017/000000039769.jpg"
image = Image.open(requests.get(url, stream=True).raw).convert("RGB")
response = model.generate_content([image, "画像を説明してください。"])
print(response)
print("---" * 40)

# examples generate using generation_config
from PIL import Image
import requests
from transformers import GenerationConfig
generation_config = {
    "max_new_tokens": 512,
    "temperature": 0.5,
    "do_sample": True,
}
generation_config = GenerationConfig(**generation_config)
url = "http://images.cocodataset.org/val2017/000000039769.jpg"
image = Image.open(requests.get(url, stream=True).raw).convert("RGB")
response = model.generate_content(
    [image, "画像を説明してください。"],
    generation_config=generation_config
)
print(response)
print("---" * 40)

# examples generate with text + image + text + image + text
from PIL import Image
import requests
url_list = [
    "https://images.unsplash.com/photo-1694831404826-3400c48c188d?q=80&w=2070&auto=format&fit=crop&ixlib=rb-4.0.3&ixid=M3wxMjA3fDB8MHxwaG90by1wYWdlfHx8fGVufDB8fHx8fA%3D%3D",
    "https://images.unsplash.com/photo-1693240876439-473af88b4ed7?q=80&w=1974&auto=format&fit=crop&ixlib=rb-4.0.3&ixid=M3wxMjA3fDB8MHxwaG90by1wYWdlfHx8fGVufDB8fHx8fA%3D%3D"
]
images = [
   Image.open(requests.get(url, stream=True).raw).convert("RGB") for url in url_list
]
response = model.generate_content([
    images[0],
    "これは日本の横断歩道の画像です",
    images[1],
    "これはオーストリアの信号機の画像です",
    "各画像に写っている歩行者用信号機の色は何色ですか?"])
print(response)
print("---" * 40)

Training Summary

Stage Training Data Sources Samples
Stage1 Projector Japanese image text pairs, LLaVA-Pretrain 1.1M
Stage2 Projector, LLM Filtered MOMIJI 3 snapshots (CC-MAIN-2024-46, CC-MAIN-2024-51, CC-MAIN-2025-05) 13M
Japanese image text pairs (subset), Japanese interleaved data (subset), mmc4-core (subset), coyo-700m (subset), wikipedia_ja, llava_pretrain_ja, stair_captions 20M
Stage3 Vision Encoder, Projector, LLM llava-instruct-v1_5-en-subset-358k, llava-instruct-ja, japanese-photos-conv, ja-vg-vqa, synthdog-ja (subset), ai2d, synthdog-en, sherlock 1.4M

Evaluation

I used llm-jp-eval-mm for this evaluation. All scores other than our models are taken from llm-jp-eval-mm leaderboard and the Asagi website.

Model LLM Size Heron-Bench overall LLM (%) JA-VLM-Bench-In-the-Wild LLM (/5.0) JA-VG-VQA-500 LLM (/5.0)
Heron NVILA-Lite 2B 1.5B 52.8 3.52 3.50
Heron NVILA-Lite 15B 14B 59.6 4.2 3.82
LLaVA-CALM2-SigLIP 7B 43.3 3.15 3.21
Llama-3-EvoVLM-JP-v2 8B 39.3 2.92 2.96
VILA-jp 13B 57.2 3.69 3.62
Asagi-14B 13B 55.8 3.44 3.84
Qwen2-VL 7B Instruct 7B 55.5 3.61 3.6
GPT-4o - 87.6 3.85 3.58

Risks and Limitations

This model is experimental and has not been thoroughly calibrated for ethical compliance or legal standards. Caution is advised for sensitive applications.

License

How to cite

@misc{HeronNVILALite2B,
    title  = {Heron NVILA-Lite 2B},
    author = {Shingo Yokoi},
    year   = {2025},
    url    = {https://huggingface.co/turing-motors/Heron-NVILA-Lite-2B},
}

Citations

@misc{liu2025nvilaefficientfrontiervisual,
      title={NVILA: Efficient Frontier Visual Language Models},
      author={Zhijian Liu and Ligeng Zhu and Baifeng Shi and Zhuoyang Zhang and Yuming Lou and Shang Yang and Haocheng Xi and Shiyi Cao and Yuxian Gu and Dacheng Li and Xiuyu Li and Yunhao Fang and Yukang Chen and Cheng-Yu Hsieh and De-An Huang and An-Chieh Cheng and Vishwesh Nath and Jinyi Hu and Sifei Liu and Ranjay Krishna and Daguang Xu and Xiaolong Wang and Pavlo Molchanov and Jan Kautz and Hongxu Yin and Song Han and Yao Lu},
      year={2025},
      eprint={2412.04468},
      archivePrefix={arXiv},
      primaryClass={cs.CV},
      url={https://arxiv.org/abs/2412.04468},
}