|
--- |
|
language: |
|
- en |
|
library_name: transformers |
|
tags: |
|
- pytorch |
|
- safetensors |
|
- vision-language |
|
- visual-question-answering |
|
pipeline_tag: visual-question-answering |
|
license: apache-2.0 |
|
base_model: |
|
- google/siglip2-so400m-patch14-384 |
|
- keeeeenw/MicroLlama |
|
model-index: |
|
- name: MicroLLaVA (MicroLLaMA 300M + SigLIP2-so400m-patch4-384) |
|
results: |
|
- task: |
|
type: visual-question-answering |
|
name: VQAv2 |
|
dataset: |
|
name: VQAv2 |
|
type: vqav2 |
|
metrics: |
|
- name: Overall Accuracy |
|
type: accuracy |
|
value: 56.91 |
|
- name: Yes/No Accuracy |
|
type: accuracy |
|
value: 72.32 |
|
- name: Number Accuracy |
|
type: accuracy |
|
value: 43.89 |
|
- name: Other Accuracy |
|
type: accuracy |
|
value: 46.65 |
|
source: |
|
name: Internal Evaluation on VQAv2 test-dev |
|
url: https://visualqa.org/download.html |
|
--- |
|
|
|
# MicroLLaVA |
|
|
|
A compact vision language model that you can pretrain and finetune on a single consumer GPU. |
|
|
|
## π Performance & Training Highlights |
|
|
|
- π **VQAv2 Accuracy**: |
|
Achieves **56.91%** on VQAv2 dev/test β making MicroLLaVA one of the best-performing open-source language models with vision capabilities under **700M parameters**. |
|
|
|
- π§ **Parameter Budget**: |
|
- π£οΈ Language Model: **MicroLLaMA (300M)** |
|
- ποΈ Vision Encoder: **SigLIP2 (400M)** |
|
β **~700M total parameters** |
|
|
|
- π **Best in Class**: |
|
According to ChatGPTβs Deep Research Agent (Aug 2025): |
|
> *βNo known open model below ~700M currently surpasses MicroLLaVAβs VQAv2 accuracy. Models that do perform better tend to have larger language components.β* |
|
|
|
- π§ͺ **Ongoing Experiments**: |
|
- π§ **Qwen3-0.6B + SigLIP2** |
|
β Training is **converging**, showing promising loss curves. (Qwen3-0.6B is significantly larger than MicroLLaMA.) |
|
- β **Gemma-3B-270M-IT + SigLIP2** |
|
β Training **did not converge**, likely due to instability, bugs, or poor alignment under current hyperparameters. |
|
|
|
## π° News and Updates |
|
|
|
* 08/17/2025: this hugging face repo is renamed to https://huggingface.co/keeeeenw/MicroLlava. |
|
* 08/17/2025: improved **VQAv2** average dev-test score from **44.01%** to **56.91%** by upgrading the vision tower from SigLip to SigLip2. |
|
* 08/09/2025: initial version of MicroLlava released |
|
|
|
## π― TLDR |
|
|
|
| Item | Detail | |
|
|-----------------|--------| |
|
| Framework | Transformers + PyTorch | |
|
| Checkpoint type | `safetensors` | |
|
| LLM | [`keeeeenw/MicroLlama`](https://huggingface.co/keeeeenw/MicroLlama) (about 300M parameters) | |
|
| Vision tower | [`siglip-so400m-patch14-384`](https://huggingface.co/google/siglip2-so400m-patch14-384) | |
|
| Hardware used | Single NVIDIA RTX 4090 | |
|
| Training stack | No DeepSpeed required | |
|
| Intended tasks | Visual Question Answering, caption-style prompts | |
|
|
|
--- |
|
|
|
## π Introduction |
|
|
|
MicroLLaVA is a [TinyLLaVA Factory](https://github.com/TinyLLaVA/TinyLLaVA_Factory) based model that pairs a very small language model [`keeeeenw/MicroLlama`](https://huggingface.co/keeeeenw/MicroLlama) with an efficient SigLIP vision encoder. |
|
The goal is to create a vision language model that almost anyone can train and iterate on with one consumer GPU. |
|
|
|
- **Language model**: [`keeeeenw/MicroLlama`](https://huggingface.co/keeeeenw/MicroLlama) with ~300M parameters |
|
- **Vision encoder**: [`siglip2-so400m-patch14-384`](https://huggingface.co/google/siglip2-so400m-patch14-384) |
|
- **Training codebase**: [TinyLLaVA Factory](https://github.com/TinyLLaVA/TinyLLaVA_Factory) with additional changes in my fork: [Custom fork with training tweaks](https://github.com/keeeeenw/TinyLLaVA_Factory) |
|
|
|
Because of its compact size, this model can be trained entirely on a single NVIDIA RTX 4090 without DeepSpeed. |
|
|
|
Pretraining on **LAION-CC-SBU-558K** took about **5 hours** on a single NVIDIA RTX 4090 without DeepSpeed. |
|
|
|
Supervised finetuning on all datasets from the TinyLLaVA Factory guide (except `ocr_vqa`) took about **12 hours** on the same GPU. |
|
|
|
--- |
|
|
|
## π Quick start |
|
|
|
```python |
|
from transformers import AutoTokenizer, AutoProcessor, AutoModelForCausalLM |
|
import torch |
|
|
|
repo_id = "keeeeenw/MicroLlava" |
|
|
|
tokenizer = AutoTokenizer.from_pretrained(repo_id) |
|
|
|
# If processor config is available |
|
try: |
|
processor = AutoProcessor.from_pretrained(repo_id) |
|
except Exception: |
|
processor = None # Optional if images are preprocessed manually |
|
|
|
model = AutoModelForCausalLM.from_pretrained( |
|
repo_id, |
|
torch_dtype=torch.float16, |
|
device_map="auto", |
|
trust_remote_code=True # Set to True if repo includes custom code |
|
) |
|
|
|
inputs = tokenizer("Describe the image in one sentence.", return_tensors="pt").to(model.device) |
|
output_ids = model.generate(**inputs, max_new_tokens=64) |
|
print(tokenizer.decode(output_ids[0], skip_special_tokens=True)) |
|
``` |
|
|
|
## π Evaluation |
|
|
|
### VQAv2 Evaluation Results (MicroLlama 300M + Siglip2-so400m-patch4-384) |
|
|
|
| Question Type | Accuracy | |
|
|---------------|----------| |
|
| Yes/No | 72.32% | |
|
| Number | 43.89% | |
|
| Other | 46.65% | |
|
| **Overall** | **56.91%** | |
|
|
|
*Evaluated on VQAv2 test-dev split* |
|
|
|
### (Previous version) VQAv2 Evaluation Results (MicroLlama 300M + Siglip-so400m-patch4-384) |
|
|
|
| Question Type | Accuracy | |
|
|---------------|----------| |
|
| Yes/No | 65.08% | |
|
| Number | 28.97% | |
|
| Other | 29.32% | |
|
| **Overall** | **44.01%** | |
|
|
|
*Evaluated on VQAv2 test-dev split* |
|
|
|
More evaluation results will be added in the coming days. |
|
|
|
Community contributions with benchmark results are welcome and encouraged. |
|
|
|
--- |
|
|
|
## β
Intended uses and limitations |
|
|
|
**Intended uses** |
|
- Rapid experimentation for vision-language research on limited hardware |
|
- Educational demonstrations for students and hobbyists |
|
- Starting point for domain-specific finetuning |
|
|
|
**Limitations** |
|
- The small LLM size and compact vision encoder may limit reasoning depth and OCR performance |
|
- Performance can vary significantly depending on the image domain and quality |
|
- The model includes minimal safety filtering and refusal behavior β downstream applications should implement their own safeguards |
|
|
|
> β οΈ This model should not be used for applications that may cause harm or have significant safety, financial, legal, or medical implications without thorough human review. |
|
|
|
--- |
|
|
|
## π Citation |
|
|
|
```bibtex |
|
@misc{wang2024microllama, |
|
title = {MicroLLaVA: a TinyLLaVA based VLM with MicroLlama 300M for single GPU training}, |
|
author = {Zixiao Ken Wang}, |
|
year = {2025}, |
|
url = {https://huggingface.co/keeeeenw/MicroLlava} |
|
} |
|
``` |
|
|
|
## π License |
|
|
|
This model is released under the [Apache License 2.0](https://www.apache.org/licenses/LICENSE-2.0). |
|
|
|
You are free to use, modify, and distribute this model and its derivatives, provided that you comply with the terms of the license. |
|
If you use this model in your research or applications, please credit the original authors and clearly indicate any modifications you have made. |
|
|
|
> **Note**: Ensure that the datasets used for pretraining or finetuning also allow redistribution of derived model weights. |
|
|
|
--- |
|
|
|
## π Acknowledgements |
|
|
|
This work builds upon the efforts of many in the open-source AI community: |
|
|
|
- **[TinyLLaVA Factory](https://github.com/TinyLLaVA/TinyLLaVA_Factory)** maintainers and contributors for creating the training framework |
|
- **[`keeeeenw/MicroLlama`](https://huggingface.co/keeeeenw/MicroLlama)** I am also the creator of MicroLlama. Please help support my work! |
|
- **SigLIP2** authors for the efficient vision encoder architecture |
|
- Contributors to **LAION-CC-SBU-558K** and other datasets used in pretraining and finetuning |
|
- The Hugging Face ecosystem for hosting, tools, and community support |