--- language: - en library_name: transformers tags: - pytorch - safetensors - vision-language - visual-question-answering pipeline_tag: visual-question-answering license: apache-2.0 base_model: - keeeeenw/MicroLlama - google/siglip-so400m-patch14-384 --- # MicroLLaVA (TinyLLaVA Factory based) A compact vision language model that you can pretrain and finetune on a single consumer GPU. ## TLDR | Item | Detail | |-----------------|--------| | Framework | Transformers + PyTorch | | Checkpoint type | `safetensors` | | LLM | [`keeeeenw/MicroLlama`](https://huggingface.co/keeeeenw/MicroLlama) (about 300M parameters) | | Vision tower | [`siglip-so400m-patch14-384`](https://huggingface.co/google/siglip-so400m-patch14-384) | | Hardware used | Single NVIDIA RTX 4090 | | Training stack | No DeepSpeed required | | Intended tasks | Visual Question Answering, caption-style prompts | --- ## Introduction MicroLLaVA is a [TinyLLaVA Factory](https://github.com/TinyLLaVA/TinyLLaVA_Factory) based model that pairs a very small language model [`keeeeenw/MicroLlama`](https://huggingface.co/keeeeenw/MicroLlama) with an efficient SigLIP vision encoder. The goal is to create a vision language model that almost anyone can train and iterate on with one consumer GPU. - **Language model**: [`keeeeenw/MicroLlama`](https://huggingface.co/keeeeenw/MicroLlama) with ~300M parameters - **Vision encoder**: [`siglip-so400m-patch14-384`](https://huggingface.co/google/siglip-so400m-patch14-384) - **Training codebase**: [TinyLLaVA Factory](https://github.com/TinyLLaVA/TinyLLaVA_Factory) with additional changes in my fork: [Custom fork with training tweaks](https://github.com/keeeeenw/TinyLLaVA_Factory) --- ## Files included | File | Purpose | |----------------------------|---------| | `config.json` | Model configuration for Transformers | | `generation_config.json` | Generation defaults | | `model.safetensors` | Weights | | `tokenizer.model` | SentencePiece model | | `tokenizer_config.json` | Tokenizer configuration | | `special_tokens_map.json` | Special token mapping | | `trainer_state.json` | Trainer state | | `training_args.bin` | Training arguments | | `log.txt` | Training log | If your workflow uses a custom processor, also include `preprocessor_config.json` or `processor_config.json` so `AutoProcessor.from_pretrained` works. Because of its compact size, this model can be trained entirely on a single NVIDIA RTX 4090 without DeepSpeed. Pretraining on **LAION-CC-SBU-558K** took about **5 hours** on a single NVIDIA RTX 4090 without DeepSpeed. Supervised finetuning on all datasets from the TinyLLaVA Factory guide (except `ocr_vqa`) took about **12 hours** on the same GPU. --- ## Quick start ```python from transformers import AutoTokenizer, AutoProcessor, AutoModelForCausalLM import torch repo_id = "keeeeenw/MicroLlava-siglip-so400m-patch14-384-base-finetune" tokenizer = AutoTokenizer.from_pretrained(repo_id) # If processor config is available try: processor = AutoProcessor.from_pretrained(repo_id) except Exception: processor = None # Optional if images are preprocessed manually model = AutoModelForCausalLM.from_pretrained( repo_id, torch_dtype=torch.float16, device_map="auto", trust_remote_code=True # Set to True if repo includes custom code ) inputs = tokenizer("Describe the image in one sentence.", return_tensors="pt").to(model.device) output_ids = model.generate(**inputs, max_new_tokens=64) print(tokenizer.decode(output_ids[0], skip_special_tokens=True)) ``` ## Evaluation Evaluation results will be added in the coming days. Planned tests include: - VQAv2-style prompts for question answering - and more Community contributions with benchmark results are welcome and encouraged. --- ## Intended uses and limitations **Intended uses** - Rapid experimentation for vision-language research on limited hardware - Educational demonstrations for students and hobbyists - Starting point for domain-specific finetuning **Limitations** - The small LLM size and compact vision encoder may limit reasoning depth and OCR performance - Performance can vary significantly depending on the image domain and quality - The model includes minimal safety filtering and refusal behavior — downstream applications should implement their own safeguards > ⚠️ This model should not be used for applications that may cause harm or have significant safety, financial, legal, or medical implications without thorough human review. --- ## Reproducibility checklist To reproduce results and training runs: 1. Fix all random seeds in training scripts 2. Record exact dataset versions and any filtering applied 3. Log optimizer type, learning rate schedule, precision settings, and gradient accumulation steps 4. Save the exact TinyLLaVA Factory commit or fork commit used for both pretraining and finetuning 5. Document hardware and software versions (CUDA, PyTorch, etc.) --- ## Citation ```bibtex @misc{wang2024microllama, title = {MicroLLaVA: a TinyLLaVA based VLM with MicroLlama 300M for single GPU training}, author = {Zixiao Ken Wang}, year = {2025}, url = {https://huggingface.co/keeeeenw/MicroLlava-siglip-so400m-patch14-384-base-finetune} } ``` ## License This model is released under the [Apache License 2.0](https://www.apache.org/licenses/LICENSE-2.0). You are free to use, modify, and distribute this model and its derivatives, provided that you comply with the terms of the license. If you use this model in your research or applications, please credit the original authors and clearly indicate any modifications you have made. > **Note**: Ensure that the datasets used for pretraining or finetuning also allow redistribution of derived model weights. --- ## Acknowledgements This work builds upon the efforts of many in the open-source AI community: - **[TinyLLaVA Factory](https://github.com/TinyLLaVA/TinyLLaVA_Factory)** maintainers and contributors for creating the training framework - **[`keeeeenw/MicroLlama`](https://huggingface.co/keeeeenw/MicroLlama)** I am also the creator of MicroLlama. Please help support my work! - **SigLIP** authors for the efficient vision encoder architecture - Contributors to **LAION-CC-SBU-558K** and other datasets used in pretraining and finetuning - The Hugging Face ecosystem for hosting, tools, and community support