--- language: - en library_name: transformers tags: - pytorch - safetensors - vision-language - visual-question-answering pipeline_tag: visual-question-answering license: apache-2.0 base_model: - google/siglip2-so400m-patch14-384 - keeeeenw/MicroLlama model-index: - name: MicroLLaVA (MicroLLaMA 300M + SigLIP2-so400m-patch4-384) results: - task: type: visual-question-answering name: VQAv2 dataset: name: VQAv2 type: vqav2 metrics: - name: Overall Accuracy type: accuracy value: 56.91 - name: Yes/No Accuracy type: accuracy value: 72.32 - name: Number Accuracy type: accuracy value: 43.89 - name: Other Accuracy type: accuracy value: 46.65 source: name: Internal Evaluation on VQAv2 test-dev url: https://visualqa.org/download.html --- # MicroLLaVA A compact vision language model that you can pretrain and finetune on a single consumer GPU. ## πŸ” Performance & Training Highlights - πŸ“Š **VQAv2 Accuracy**: Achieves **56.91%** on VQAv2 dev/test β€” making MicroLLaVA one of the best-performing open-source language models with vision capabilities under **700M parameters**. - 🧠 **Parameter Budget**: - πŸ—£οΈ Language Model: **MicroLLaMA (300M)** - πŸ‘οΈ Vision Encoder: **SigLIP2 (400M)** β†’ **~700M total parameters** - πŸ† **Best in Class**: According to ChatGPT’s Deep Research Agent (Aug 2025): > *β€œNo known open model below ~700M currently surpasses MicroLLaVA’s VQAv2 accuracy. Models that do perform better tend to have larger language components.”* - πŸ§ͺ **Ongoing Experiments**: - πŸ”§ **Qwen3-0.6B + SigLIP2** β†’ Training is **converging**, showing promising loss curves. (Qwen3-0.6B is significantly larger than MicroLLaMA.) - ❌ **Gemma-3B-270M-IT + SigLIP2** β†’ Training **did not converge**, likely due to instability, bugs, or poor alignment under current hyperparameters. ## πŸ“° News and Updates * 08/17/2025: this hugging face repo is renamed to https://huggingface.co/keeeeenw/MicroLlava. * 08/17/2025: improved **VQAv2** average dev-test score from **44.01%** to **56.91%** by upgrading the vision tower from SigLip to SigLip2. * 08/09/2025: initial version of MicroLlava released ## 🎯 TLDR | Item | Detail | |-----------------|--------| | Framework | Transformers + PyTorch | | Checkpoint type | `safetensors` | | LLM | [`keeeeenw/MicroLlama`](https://huggingface.co/keeeeenw/MicroLlama) (about 300M parameters) | | Vision tower | [`siglip-so400m-patch14-384`](https://huggingface.co/google/siglip2-so400m-patch14-384) | | Hardware used | Single NVIDIA RTX 4090 | | Training stack | No DeepSpeed required | | Intended tasks | Visual Question Answering, caption-style prompts | --- ## πŸ“‹ Introduction MicroLLaVA is a [TinyLLaVA Factory](https://github.com/TinyLLaVA/TinyLLaVA_Factory) based model that pairs a very small language model [`keeeeenw/MicroLlama`](https://huggingface.co/keeeeenw/MicroLlama) with an efficient SigLIP vision encoder. The goal is to create a vision language model that almost anyone can train and iterate on with one consumer GPU. - **Language model**: [`keeeeenw/MicroLlama`](https://huggingface.co/keeeeenw/MicroLlama) with ~300M parameters - **Vision encoder**: [`siglip2-so400m-patch14-384`](https://huggingface.co/google/siglip2-so400m-patch14-384) - **Training codebase**: [TinyLLaVA Factory](https://github.com/TinyLLaVA/TinyLLaVA_Factory) with additional changes in my fork: [Custom fork with training tweaks](https://github.com/keeeeenw/TinyLLaVA_Factory) Because of its compact size, this model can be trained entirely on a single NVIDIA RTX 4090 without DeepSpeed. Pretraining on **LAION-CC-SBU-558K** took about **5 hours** on a single NVIDIA RTX 4090 without DeepSpeed. Supervised finetuning on all datasets from the TinyLLaVA Factory guide (except `ocr_vqa`) took about **12 hours** on the same GPU. --- ## πŸš€ Quick start ```python from transformers import AutoTokenizer, AutoProcessor, AutoModelForCausalLM import torch repo_id = "keeeeenw/MicroLlava" tokenizer = AutoTokenizer.from_pretrained(repo_id) # If processor config is available try: processor = AutoProcessor.from_pretrained(repo_id) except Exception: processor = None # Optional if images are preprocessed manually model = AutoModelForCausalLM.from_pretrained( repo_id, torch_dtype=torch.float16, device_map="auto", trust_remote_code=True # Set to True if repo includes custom code ) inputs = tokenizer("Describe the image in one sentence.", return_tensors="pt").to(model.device) output_ids = model.generate(**inputs, max_new_tokens=64) print(tokenizer.decode(output_ids[0], skip_special_tokens=True)) ``` ## πŸ† Evaluation ### VQAv2 Evaluation Results (MicroLlama 300M + Siglip2-so400m-patch4-384) | Question Type | Accuracy | |---------------|----------| | Yes/No | 72.32% | | Number | 43.89% | | Other | 46.65% | | **Overall** | **56.91%** | *Evaluated on VQAv2 test-dev split* ### (Previous version) VQAv2 Evaluation Results (MicroLlama 300M + Siglip-so400m-patch4-384) | Question Type | Accuracy | |---------------|----------| | Yes/No | 65.08% | | Number | 28.97% | | Other | 29.32% | | **Overall** | **44.01%** | *Evaluated on VQAv2 test-dev split* More evaluation results will be added in the coming days. Community contributions with benchmark results are welcome and encouraged. --- ## βœ… Intended uses and limitations **Intended uses** - Rapid experimentation for vision-language research on limited hardware - Educational demonstrations for students and hobbyists - Starting point for domain-specific finetuning **Limitations** - The small LLM size and compact vision encoder may limit reasoning depth and OCR performance - Performance can vary significantly depending on the image domain and quality - The model includes minimal safety filtering and refusal behavior β€” downstream applications should implement their own safeguards > ⚠️ This model should not be used for applications that may cause harm or have significant safety, financial, legal, or medical implications without thorough human review. --- ## πŸ“ Citation ```bibtex @misc{wang2024microllama, title = {MicroLLaVA: a TinyLLaVA based VLM with MicroLlama 300M for single GPU training}, author = {Zixiao Ken Wang}, year = {2025}, url = {https://huggingface.co/keeeeenw/MicroLlava} } ``` ## πŸ“„ License This model is released under the [Apache License 2.0](https://www.apache.org/licenses/LICENSE-2.0). You are free to use, modify, and distribute this model and its derivatives, provided that you comply with the terms of the license. If you use this model in your research or applications, please credit the original authors and clearly indicate any modifications you have made. > **Note**: Ensure that the datasets used for pretraining or finetuning also allow redistribution of derived model weights. --- ## πŸ™ Acknowledgements This work builds upon the efforts of many in the open-source AI community: - **[TinyLLaVA Factory](https://github.com/TinyLLaVA/TinyLLaVA_Factory)** maintainers and contributors for creating the training framework - **[`keeeeenw/MicroLlama`](https://huggingface.co/keeeeenw/MicroLlama)** I am also the creator of MicroLlama. Please help support my work! - **SigLIP2** authors for the efficient vision encoder architecture - Contributors to **LAION-CC-SBU-558K** and other datasets used in pretraining and finetuning - The Hugging Face ecosystem for hosting, tools, and community support