keeeeenw's picture
Update README.md
96a1e80 verified
|
raw
history blame
6.61 kB
metadata
license: apache-2.0
language:
  - en
library_name: transformers
tags:
  - pytorch
  - safetensors
  - vision-language
  - visual-question-answering
pipeline_tag: visual-question-answering
base_model:
  - Qwen/Qwen3-0.6B-Base
model-index:
  - name: MicroLLaVA-Qwen3-0.6B-base-siglip2-so400m
    results:
      - task:
          type: visual-question-answering
          name: VQAv2
        dataset:
          name: VQAv2
          type: vqav2
        metrics:
          - name: Overall Accuracy
            type: accuracy
            value: 78.52
          - name: Yes/No Accuracy
            type: accuracy
            value: 91.56
          - name: Number Accuracy
            type: accuracy
            value: 65.69
          - name: Other Accuracy
            type: accuracy
            value: 70.28
        source:
          name: Internal Evaluation on VQAv2 test-dev
          url: https://visualqa.org/download.html

πŸ¦™πŸ“· MicroLLaVA-Qwen3-0.6B-base-siglip2-so400m

A compact yet competitive open-source vision-language model trained from scratch on a single RTX 4090.

This is a ~1B parameter model that performs on par with the original LLaVA-1.5-7B, making it a serious candidate for visual question answering tasksβ€”both for everyday users and researchers focused on efficient multimodal architectures.

It is also well suited for Edge AI applications, such as on-device visual question answering, thanks to its small size and fast inference performance.


πŸ“Œ Model Summary

keeeeenw/MicroLLaVA-Qwen3-0.6B-base-siglip2-so400m combines the strengths of:

Trained using a modified TinyLLaVA Factory fork, this model reaches 78.52% overall on VQAv2 test-dev, outperforming:

  • πŸ₯ˆ Zhang199/TinyLLaVA-Qwen2-0.5B-SigLIP: 72.33%
  • πŸ₯‰ On par with LLaVA-1.5-7B: ~78.5%

See https://huggingface.co/Zhang199/TinyLLaVA-Qwen2-0.5B-SigLIP#result for evaluation of the original models from TinyLLaVa and LLaVa authors.


πŸ–ΌοΈ Example Inference

Code:


# Tested with transformers: 4.55.2, torch: 2.8.0, and torchvision: 0.23.0
# Older versions of the transformers do not support qwen3 model.
# pip install transformers torch torchvision

from transformers import AutoTokenizer, AutoModelForCausalLM

hf_path = 'keeeeenw/MicroLlava-Qwen3-0.6B-base-siglip2-so400m'
model = AutoModelForCausalLM.from_pretrained(hf_path, trust_remote_code=True)
# model.cuda() # if no cuda is used, it takes ~78 seconds inference on my AMD 5950 CPU. 
config = model.config
tokenizer = AutoTokenizer.from_pretrained(hf_path, use_fast=False, model_max_length = config.tokenizer_model_max_length,padding_side = config.tokenizer_padding_side)

prompt="What are the things I should be cautious about when I visit here?"
image_url="https://llava-vl.github.io/static/images/view.jpg"
output_text, genertaion_time = model.chat(prompt=prompt,
                                          image=image_url,
                                          tokenizer=tokenizer)

print('model output:', output_text)
print('runing time:', genertaion_time)

Prompt:
🧠 "What are the things I should be cautious about when I visit here?"

Image:
example image

Model Output: When visiting this wooden pier on the lake, there are a few things to be cautious about. First, be aware of the water depth, as the water may be shallow or choppy, which could pose a risk of falling into the water or getting caught in a strong current. Second, be mindful of the weather conditions, as the image shows a cloudy day, which might indicate the possibility of rain or strong winds. These conditions can make the experience more challenging and potentially dangerous. Lastly, be cautious about the presence of any wildlife in the area, as the lake and its surroundings may attract various animals. Always follow safety guidelines and be respectful of the natural environment when visiting this picturesque location.


βš™οΈ Training Details

  • 🧠 Total parameters: ~1B
  • πŸ–₯️ Hardware: Single NVIDIA RTX 4090 (24GB VRAM)
  • ⏱️ Total training time: ~24 hours
    • Stage 1 (pretraining): ~8 hours
    • Stage 2 (fine-tuning): ~12 hours
  • 🧾 Training method: Follows official TinyLLaVA instructions
    • ⚠️ ocr_vqa was omitted in fine-tuning due to dataset access issues (to be re-added in next training run)

✨ Key Contributions

  1. βœ… Validated Qwen3-0.6B + SigLIP2 as an efficient, high-performance combination for vision-language tasks
  2. πŸ› οΈ Upgraded TinyLLaVA to support Qwen3 models and latest PyTorch/transformers
  3. 🧩 Created a new Qwen3 chat template
    β†’ qwen3_base_template.py
  4. βš™οΈ Performed hyperparameter tuning for optimal Qwen3 + SigLIP2 performance
  5. πŸš€ Released standalone Hugging Face inference supportβ€”no need to install TinyLLaVA

πŸ“Š VQAv2 Evaluation (test-dev)

Question Type Accuracy
Yes/No 91.56%
Number 65.69%
Other 70.28%
Overall 78.52%

Note: Evaluation was performed on the VQAv2 test-dev set. It is unclear whether prior models used test-dev or test set. This model is currently being evaluated on the full test set and other benchmarks.


πŸ”œ Upcoming Work

This model is currently undergoing evaluation on:

  • VQAv2 test set
  • GQA
  • SQA
  • TextVQA
  • MM-VET
  • POPE
  • MME
  • MMMU

Stay tuned for updates!


🧾 Citation

If you find this model helpful, please consider citing or referencing this repo:

@misc{wang2024microllama,
  title        = {MicroLLaVA: a TinyLLaVA based VLM with MicroLlama 300M for single GPU training},
  author       = {Zixiao Ken Wang},
  year         = {2025},
  url          = {https://huggingface.co/keeeeenw/MicroLlava,https://huggingface.co/keeeeenw/MicroLlava-Qwen3-0.6B-base-siglip2-so400m}
}

Please also support my release of https://huggingface.co/keeeeenw/MicroLlava which is based on own https://huggingface.co/keeeeenw/MicroLlama for the language capabilities.