keeeeenw commited on
Commit
df4a083
·
verified ·
1 Parent(s): 364e925

Create README.md

Browse files
Files changed (1) hide show
  1. README.md +168 -0
README.md ADDED
@@ -0,0 +1,168 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ language:
3
+ - en
4
+ library_name: transformers
5
+ tags:
6
+ - pytorch
7
+ - safetensors
8
+ - vision-language
9
+ - visual-question-answering
10
+ pipeline_tag: visual-question-answering
11
+ license: apache-2.0
12
+ base_model:
13
+ - keeeeenw/MicroLlama
14
+ - google/siglip-so400m-patch14-384
15
+ ---
16
+
17
+ # MicroLLaVA (TinyLLaVA Factory based)
18
+
19
+ A compact vision language model that you can pretrain and finetune on a single consumer GPU.
20
+
21
+ ## TLDR
22
+
23
+ | Item | Detail |
24
+ |-----------------|--------|
25
+ | Framework | Transformers + PyTorch |
26
+ | Checkpoint type | `safetensors` |
27
+ | LLM | [`keeeeenw/MicroLlama`](https://huggingface.co/keeeeenw/MicroLlama) (about 300M parameters) |
28
+ | Vision tower | [`siglip-so400m-patch14-384`](https://huggingface.co/google/siglip-so400m-patch14-384) |
29
+ | Hardware used | Single NVIDIA RTX 4090 |
30
+ | Training stack | No DeepSpeed required |
31
+ | Intended tasks | Visual Question Answering, caption-style prompts |
32
+
33
+ ---
34
+
35
+ ## Introduction
36
+
37
+ MicroLLaVA is a [TinyLLaVA Factory](https://github.com/TinyLLaVA/TinyLLaVA_Factory) based model that pairs a very small language model [`keeeeenw/MicroLlama`](https://huggingface.co/keeeeenw/MicroLlama) with an efficient SigLIP vision encoder.
38
+ The goal is to create a vision language model that almost anyone can train and iterate on with one consumer GPU.
39
+
40
+ - **Language model**: [`keeeeenw/MicroLlama`](https://huggingface.co/keeeeenw/MicroLlama) with ~300M parameters
41
+ - **Vision encoder**: [`siglip-so400m-patch14-384`](https://huggingface.co/google/siglip-so400m-patch14-384)
42
+ - **Training codebase**: [TinyLLaVA Factory](https://github.com/TinyLLaVA/TinyLLaVA_Factory) with additional changes in my fork: [Custom fork with training tweaks](https://github.com/keeeeenw/TinyLLaVA_Factory)
43
+
44
+ ---
45
+
46
+ ## Files included
47
+
48
+ | File | Purpose |
49
+ |----------------------------|---------|
50
+ | `config.json` | Model configuration for Transformers |
51
+ | `generation_config.json` | Generation defaults |
52
+ | `model.safetensors` | Weights |
53
+ | `tokenizer.model` | SentencePiece model |
54
+ | `tokenizer_config.json` | Tokenizer configuration |
55
+ | `special_tokens_map.json` | Special token mapping |
56
+ | `trainer_state.json` | Trainer state |
57
+ | `training_args.bin` | Training arguments |
58
+ | `log.txt` | Training log |
59
+
60
+ If your workflow uses a custom processor, also include `preprocessor_config.json` or `processor_config.json` so `AutoProcessor.from_pretrained` works.
61
+
62
+ Because of its compact size, this model can be trained entirely on a single NVIDIA RTX 4090 without DeepSpeed.
63
+
64
+ Pretraining on **LAION-CC-SBU-558K** took about **5 hours** on a single NVIDIA RTX 4090 without DeepSpeed.
65
+
66
+ Supervised finetuning on all datasets from the TinyLLaVA Factory guide (except `ocr_vqa`) took about **12 hours** on the same GPU.
67
+
68
+ ---
69
+
70
+ ## Quick start
71
+
72
+ ```python
73
+ from transformers import AutoTokenizer, AutoProcessor, AutoModelForCausalLM
74
+ import torch
75
+
76
+ repo_id = "keeeeenw/MicroLlava-siglip-so400m-patch14-384-base-finetune"
77
+
78
+ tokenizer = AutoTokenizer.from_pretrained(repo_id)
79
+
80
+ # If processor config is available
81
+ try:
82
+ processor = AutoProcessor.from_pretrained(repo_id)
83
+ except Exception:
84
+ processor = None # Optional if images are preprocessed manually
85
+
86
+ model = AutoModelForCausalLM.from_pretrained(
87
+ repo_id,
88
+ torch_dtype=torch.float16,
89
+ device_map="auto",
90
+ trust_remote_code=True # Set to True if repo includes custom code
91
+ )
92
+
93
+ inputs = tokenizer("Describe the image in one sentence.", return_tensors="pt").to(model.device)
94
+ output_ids = model.generate(**inputs, max_new_tokens=64)
95
+ print(tokenizer.decode(output_ids[0], skip_special_tokens=True))
96
+ ```
97
+
98
+ ## Evaluation
99
+
100
+ Evaluation results will be added in the coming days. Planned tests include:
101
+
102
+ - VQAv2-style prompts for question answering
103
+ - and more
104
+
105
+ Community contributions with benchmark results are welcome and encouraged.
106
+
107
+ ---
108
+
109
+ ## Intended uses and limitations
110
+
111
+ **Intended uses**
112
+ - Rapid experimentation for vision-language research on limited hardware
113
+ - Educational demonstrations for students and hobbyists
114
+ - Starting point for domain-specific finetuning
115
+
116
+ **Limitations**
117
+ - The small LLM size and compact vision encoder may limit reasoning depth and OCR performance
118
+ - Performance can vary significantly depending on the image domain and quality
119
+ - The model includes minimal safety filtering and refusal behavior — downstream applications should implement their own safeguards
120
+
121
+ > ⚠️ This model should not be used for applications that may cause harm or have significant safety, financial, legal, or medical implications without thorough human review.
122
+
123
+ ---
124
+
125
+ ## Reproducibility checklist
126
+
127
+ To reproduce results and training runs:
128
+
129
+ 1. Fix all random seeds in training scripts
130
+ 2. Record exact dataset versions and any filtering applied
131
+ 3. Log optimizer type, learning rate schedule, precision settings, and gradient accumulation steps
132
+ 4. Save the exact TinyLLaVA Factory commit or fork commit used for both pretraining and finetuning
133
+ 5. Document hardware and software versions (CUDA, PyTorch, etc.)
134
+
135
+ ---
136
+
137
+ ## Citation
138
+
139
+ ```bibtex
140
+ @misc{wang2024microllama,
141
+ title = {MicroLLaVA: a TinyLLaVA based VLM with MicroLlama 300M for single GPU training},
142
+ author = {Zixiao Ken Wang},
143
+ year = {2025},
144
+ url = {https://huggingface.co/keeeeenw/MicroLlava-siglip-so400m-patch14-384-base-finetune}
145
+ }
146
+ ```
147
+
148
+ ## License
149
+
150
+ This model is released under the [Apache License 2.0](https://www.apache.org/licenses/LICENSE-2.0).
151
+
152
+ You are free to use, modify, and distribute this model and its derivatives, provided that you comply with the terms of the license.
153
+ If you use this model in your research or applications, please credit the original authors and clearly indicate any modifications you have made.
154
+
155
+ > **Note**: Ensure that the datasets used for pretraining or finetuning also allow redistribution of derived model weights.
156
+
157
+ ---
158
+
159
+ ## Acknowledgements
160
+
161
+ This work builds upon the efforts of many in the open-source AI community:
162
+
163
+ - **[TinyLLaVA Factory](https://github.com/TinyLLaVA/TinyLLaVA_Factory)** maintainers and contributors for creating the training framework
164
+ - **[`keeeeenw/MicroLlama`](https://huggingface.co/keeeeenw/MicroLlama)** I am also the creator of MicroLlama. Please help support my work!
165
+ - **SigLIP** authors for the efficient vision encoder architecture
166
+ - Contributors to **LAION-CC-SBU-558K** and other datasets used in pretraining and finetuning
167
+ - The Hugging Face ecosystem for hosting, tools, and community support
168
+