MicroLlava / README.md

Update README.md

4c0a2b2 verified 20 days ago

7.69 kB

	---
	language:
	- en
	library_name: transformers
	tags:
	- pytorch
	- safetensors
	- vision-language
	- visual-question-answering
	pipeline_tag: visual-question-answering
	license: apache-2.0
	base_model:
	- google/siglip2-so400m-patch14-384
	- keeeeenw/MicroLlama
	model-index:
	- name: MicroLLaVA (MicroLLaMA 300M + SigLIP2-so400m-patch4-384)
	results:
	- task:
	type: visual-question-answering
	name: VQAv2
	dataset:
	name: VQAv2
	type: vqav2
	metrics:
	- name: Overall Accuracy
	type: accuracy
	value: 56.91
	- name: Yes/No Accuracy
	type: accuracy
	value: 72.32
	- name: Number Accuracy
	type: accuracy
	value: 43.89
	- name: Other Accuracy
	type: accuracy
	value: 46.65
	source:
	name: Internal Evaluation on VQAv2 test-dev
	url: https://visualqa.org/download.html
	---

	# MicroLLaVA

	A compact vision language model that you can pretrain and finetune on a single consumer GPU.

	## 🔍 Performance & Training Highlights

	- 📊 VQAv2 Accuracy:
	Achieves 56.91% on VQAv2 dev/test — making MicroLLaVA one of the best-performing open-source language models with vision capabilities under 700M parameters.

	- 🧠 Parameter Budget:
	- 🗣️ Language Model: MicroLLaMA (300M)
	- 👁️ Vision Encoder: SigLIP2 (400M)
	→ ~700M total parameters

	- 🏆 Best in Class:
	According to ChatGPT’s Deep Research Agent (Aug 2025):
	> “No known open model below ~700M currently surpasses MicroLLaVA’s VQAv2 accuracy. Models that do perform better tend to have larger language components.”

	- 🧪 Ongoing Experiments:
	- 🔧 Qwen3-0.6B + SigLIP2
	→ Training is converging, showing promising loss curves. (Qwen3-0.6B is significantly larger than MicroLLaMA.)
	- ❌ Gemma-3B-270M-IT + SigLIP2
	→ Training did not converge, likely due to instability, bugs, or poor alignment under current hyperparameters.

	## 📰 News and Updates

	* 08/17/2025: this hugging face repo is renamed to https://huggingface.co/keeeeenw/MicroLlava.
	* 08/17/2025: improved VQAv2 average dev-test score from 44.01% to 56.91% by upgrading the vision tower from SigLip to SigLip2.
	* 08/09/2025: initial version of MicroLlava released

	## 🎯 TLDR

	\| Item \| Detail \|
	\|-----------------\|--------\|
	\| Framework \| Transformers + PyTorch \|
	\| Checkpoint type \| `safetensors` \|
	\| LLM \| [`keeeeenw/MicroLlama`](https://huggingface.co/keeeeenw/MicroLlama) (about 300M parameters) \|
	\| Vision tower \| [`siglip-so400m-patch14-384`](https://huggingface.co/google/siglip2-so400m-patch14-384) \|
	\| Hardware used \| Single NVIDIA RTX 4090 \|
	\| Training stack \| No DeepSpeed required \|
	\| Intended tasks \| Visual Question Answering, caption-style prompts \|

	---

	## 📋 Introduction

	MicroLLaVA is a [TinyLLaVA Factory](https://github.com/TinyLLaVA/TinyLLaVA_Factory) based model that pairs a very small language model [`keeeeenw/MicroLlama`](https://huggingface.co/keeeeenw/MicroLlama) with an efficient SigLIP vision encoder.
	The goal is to create a vision language model that almost anyone can train and iterate on with one consumer GPU.

	- Language model: [`keeeeenw/MicroLlama`](https://huggingface.co/keeeeenw/MicroLlama) with ~300M parameters
	- Vision encoder: [`siglip2-so400m-patch14-384`](https://huggingface.co/google/siglip2-so400m-patch14-384)
	- Training codebase: [TinyLLaVA Factory](https://github.com/TinyLLaVA/TinyLLaVA_Factory) with additional changes in my fork: [Custom fork with training tweaks](https://github.com/keeeeenw/TinyLLaVA_Factory)

	Because of its compact size, this model can be trained entirely on a single NVIDIA RTX 4090 without DeepSpeed.

	Pretraining on LAION-CC-SBU-558K took about 5 hours on a single NVIDIA RTX 4090 without DeepSpeed.

	Supervised finetuning on all datasets from the TinyLLaVA Factory guide (except `ocr_vqa`) took about 12 hours on the same GPU.

	---

	## 🚀 Quick start

	```python
	from transformers import AutoTokenizer, AutoProcessor, AutoModelForCausalLM
	import torch

	repo_id = "keeeeenw/MicroLlava"

	tokenizer = AutoTokenizer.from_pretrained(repo_id)

	# If processor config is available
	try:
	processor = AutoProcessor.from_pretrained(repo_id)
	except Exception:
	processor = None # Optional if images are preprocessed manually

	model = AutoModelForCausalLM.from_pretrained(
	repo_id,
	torch_dtype=torch.float16,
	device_map="auto",
	trust_remote_code=True # Set to True if repo includes custom code
	)

	inputs = tokenizer("Describe the image in one sentence.", return_tensors="pt").to(model.device)
	output_ids = model.generate(**inputs, max_new_tokens=64)
	print(tokenizer.decode(output_ids[0], skip_special_tokens=True))
	```

	## 🏆 Evaluation

	### VQAv2 Evaluation Results (MicroLlama 300M + Siglip2-so400m-patch4-384)

	\| Question Type \| Accuracy \|
	\|---------------\|----------\|
	\| Yes/No \| 72.32% \|
	\| Number \| 43.89% \|
	\| Other \| 46.65% \|
	\| Overall \| 56.91% \|

	Evaluated on VQAv2 test-dev split

	### (Previous version) VQAv2 Evaluation Results (MicroLlama 300M + Siglip-so400m-patch4-384)

	\| Question Type \| Accuracy \|
	\|---------------\|----------\|
	\| Yes/No \| 65.08% \|
	\| Number \| 28.97% \|
	\| Other \| 29.32% \|
	\| Overall \| 44.01% \|

	Evaluated on VQAv2 test-dev split

	More evaluation results will be added in the coming days.

	Community contributions with benchmark results are welcome and encouraged.

	---

	## ✅ Intended uses and limitations

	Intended uses
	- Rapid experimentation for vision-language research on limited hardware
	- Educational demonstrations for students and hobbyists
	- Starting point for domain-specific finetuning

	Limitations
	- The small LLM size and compact vision encoder may limit reasoning depth and OCR performance
	- Performance can vary significantly depending on the image domain and quality
	- The model includes minimal safety filtering and refusal behavior — downstream applications should implement their own safeguards

	> ⚠️ This model should not be used for applications that may cause harm or have significant safety, financial, legal, or medical implications without thorough human review.

	---

	## 📝 Citation

	```bibtex
	@misc{wang2024microllama,
	title = {MicroLLaVA: a TinyLLaVA based VLM with MicroLlama 300M for single GPU training},
	author = {Zixiao Ken Wang},
	year = {2025},
	url = {https://huggingface.co/keeeeenw/MicroLlava}
	}
	```

	## 📄 License

	This model is released under the [Apache License 2.0](https://www.apache.org/licenses/LICENSE-2.0).

	You are free to use, modify, and distribute this model and its derivatives, provided that you comply with the terms of the license.
	If you use this model in your research or applications, please credit the original authors and clearly indicate any modifications you have made.

	> Note: Ensure that the datasets used for pretraining or finetuning also allow redistribution of derived model weights.

	---

	## 🙏 Acknowledgements

	This work builds upon the efforts of many in the open-source AI community:

	- [TinyLLaVA Factory](https://github.com/TinyLLaVA/TinyLLaVA_Factory) maintainers and contributors for creating the training framework
	- [`keeeeenw/MicroLlama`](https://huggingface.co/keeeeenw/MicroLlama) I am also the creator of MicroLlama. Please help support my work!
	- SigLIP2 authors for the efficient vision encoder architecture
	- Contributors to LAION-CC-SBU-558K and other datasets used in pretraining and finetuning
	- The Hugging Face ecosystem for hosting, tools, and community support