---
language:
- en
library_name: transformers
tags:
- pytorch
- safetensors
- vision-language
- visual-question-answering
pipeline_tag: visual-question-answering
license: apache-2.0
base_model:
- keeeeenw/MicroLlama
- google/siglip-so400m-patch14-384
---

# MicroLLaVA (TinyLLaVA Factory based)

A compact vision language model that you can pretrain and finetune on a single consumer GPU.

## TLDR

| Item            | Detail |
|-----------------|--------|
| Framework       | Transformers + PyTorch |
| Checkpoint type | `safetensors` |
| LLM             | [`keeeeenw/MicroLlama`](https://huggingface.co/keeeeenw/MicroLlama) (about 300M parameters) |
| Vision tower    | [`siglip-so400m-patch14-384`](https://huggingface.co/google/siglip-so400m-patch14-384) |
| Hardware used   | Single NVIDIA RTX 4090 |
| Training stack  | No DeepSpeed required |
| Intended tasks  | Visual Question Answering, caption-style prompts |

---

## Introduction

MicroLLaVA is a [TinyLLaVA Factory](https://github.com/TinyLLaVA/TinyLLaVA_Factory) based model that pairs a very small language model [`keeeeenw/MicroLlama`](https://huggingface.co/keeeeenw/MicroLlama) with an efficient SigLIP vision encoder.  
The goal is to create a vision language model that almost anyone can train and iterate on with one consumer GPU.

- **Language model**: [`keeeeenw/MicroLlama`](https://huggingface.co/keeeeenw/MicroLlama) with ~300M parameters  
- **Vision encoder**: [`siglip-so400m-patch14-384`](https://huggingface.co/google/siglip-so400m-patch14-384)
- **Training codebase**: [TinyLLaVA Factory](https://github.com/TinyLLaVA/TinyLLaVA_Factory) with additional changes in my fork: [Custom fork with training tweaks](https://github.com/keeeeenw/TinyLLaVA_Factory)

---

## Files included

| File                       | Purpose |
|----------------------------|---------|
| `config.json`              | Model configuration for Transformers |
| `generation_config.json`   | Generation defaults |
| `model.safetensors`        | Weights |
| `tokenizer.model`          | SentencePiece model |
| `tokenizer_config.json`    | Tokenizer configuration |
| `special_tokens_map.json`  | Special token mapping |
| `trainer_state.json`       | Trainer state |
| `training_args.bin`        | Training arguments |
| `log.txt`                  | Training log |

If your workflow uses a custom processor, also include `preprocessor_config.json` or `processor_config.json` so `AutoProcessor.from_pretrained` works.

Because of its compact size, this model can be trained entirely on a single NVIDIA RTX 4090 without DeepSpeed.  

Pretraining on **LAION-CC-SBU-558K** took about **5 hours** on a single NVIDIA RTX 4090 without DeepSpeed.

Supervised finetuning on all datasets from the TinyLLaVA Factory guide (except `ocr_vqa`) took about **12 hours** on the same GPU.

---

## Quick start

```python
from transformers import AutoTokenizer, AutoProcessor, AutoModelForCausalLM
import torch

repo_id = "keeeeenw/MicroLlava-siglip-so400m-patch14-384-base-finetune"

tokenizer = AutoTokenizer.from_pretrained(repo_id)

# If processor config is available
try:
    processor = AutoProcessor.from_pretrained(repo_id)
except Exception:
    processor = None  # Optional if images are preprocessed manually

model = AutoModelForCausalLM.from_pretrained(
    repo_id,
    torch_dtype=torch.float16,
    device_map="auto",
    trust_remote_code=True  # Set to True if repo includes custom code
)

inputs = tokenizer("Describe the image in one sentence.", return_tensors="pt").to(model.device)
output_ids = model.generate(**inputs, max_new_tokens=64)
print(tokenizer.decode(output_ids[0], skip_special_tokens=True))
```

## Evaluation

Evaluation results will be added in the coming days. Planned tests include:

- VQAv2-style prompts for question answering  
- and more 

Community contributions with benchmark results are welcome and encouraged.

---

## Intended uses and limitations

**Intended uses**
- Rapid experimentation for vision-language research on limited hardware  
- Educational demonstrations for students and hobbyists  
- Starting point for domain-specific finetuning  

**Limitations**
- The small LLM size and compact vision encoder may limit reasoning depth and OCR performance  
- Performance can vary significantly depending on the image domain and quality  
- The model includes minimal safety filtering and refusal behavior — downstream applications should implement their own safeguards  

> ⚠️ This model should not be used for applications that may cause harm or have significant safety, financial, legal, or medical implications without thorough human review.

---

## Reproducibility checklist

To reproduce results and training runs:

1. Fix all random seeds in training scripts  
2. Record exact dataset versions and any filtering applied  
3. Log optimizer type, learning rate schedule, precision settings, and gradient accumulation steps  
4. Save the exact TinyLLaVA Factory commit or fork commit used for both pretraining and finetuning  
5. Document hardware and software versions (CUDA, PyTorch, etc.)

---

## Citation

```bibtex
@misc{wang2024microllama,
  title        = {MicroLLaVA: a TinyLLaVA based VLM with MicroLlama 300M for single GPU training},
  author       = {Zixiao Ken Wang},
  year         = {2025},
  url          = {https://huggingface.co/keeeeenw/MicroLlava-siglip-so400m-patch14-384-base-finetune}
}
```

## License

This model is released under the [Apache License 2.0](https://www.apache.org/licenses/LICENSE-2.0).  

You are free to use, modify, and distribute this model and its derivatives, provided that you comply with the terms of the license.  
If you use this model in your research or applications, please credit the original authors and clearly indicate any modifications you have made.  

> **Note**: Ensure that the datasets used for pretraining or finetuning also allow redistribution of derived model weights.

---

## Acknowledgements

This work builds upon the efforts of many in the open-source AI community:

- **[TinyLLaVA Factory](https://github.com/TinyLLaVA/TinyLLaVA_Factory)** maintainers and contributors for creating the training framework  
- **[`keeeeenw/MicroLlama`](https://huggingface.co/keeeeenw/MicroLlama)** I am also the creator of MicroLlama. Please help support my work!
- **SigLIP** authors for the efficient vision encoder architecture  
- Contributors to **LAION-CC-SBU-558K** and other datasets used in pretraining and finetuning  
- The Hugging Face ecosystem for hosting, tools, and community support