Create README.md
Browse files
    	
        README.md
    ADDED
    
    | @@ -0,0 +1,168 @@ | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | 
|  | |
| 1 | 
            +
            ---
         | 
| 2 | 
            +
            language:
         | 
| 3 | 
            +
            - en
         | 
| 4 | 
            +
            library_name: transformers
         | 
| 5 | 
            +
            tags:
         | 
| 6 | 
            +
            - pytorch
         | 
| 7 | 
            +
            - safetensors
         | 
| 8 | 
            +
            - vision-language
         | 
| 9 | 
            +
            - visual-question-answering
         | 
| 10 | 
            +
            pipeline_tag: visual-question-answering
         | 
| 11 | 
            +
            license: apache-2.0
         | 
| 12 | 
            +
            base_model:
         | 
| 13 | 
            +
            - keeeeenw/MicroLlama
         | 
| 14 | 
            +
            - google/siglip-so400m-patch14-384
         | 
| 15 | 
            +
            ---
         | 
| 16 | 
            +
             | 
| 17 | 
            +
            # MicroLLaVA (TinyLLaVA Factory based)
         | 
| 18 | 
            +
             | 
| 19 | 
            +
            A compact vision language model that you can pretrain and finetune on a single consumer GPU.
         | 
| 20 | 
            +
             | 
| 21 | 
            +
            ## TLDR
         | 
| 22 | 
            +
             | 
| 23 | 
            +
            | Item            | Detail |
         | 
| 24 | 
            +
            |-----------------|--------|
         | 
| 25 | 
            +
            | Framework       | Transformers + PyTorch |
         | 
| 26 | 
            +
            | Checkpoint type | `safetensors` |
         | 
| 27 | 
            +
            | LLM             | [`keeeeenw/MicroLlama`](https://huggingface.co/keeeeenw/MicroLlama) (about 300M parameters) |
         | 
| 28 | 
            +
            | Vision tower    | [`siglip-so400m-patch14-384`](https://huggingface.co/google/siglip-so400m-patch14-384) |
         | 
| 29 | 
            +
            | Hardware used   | Single NVIDIA RTX 4090 |
         | 
| 30 | 
            +
            | Training stack  | No DeepSpeed required |
         | 
| 31 | 
            +
            | Intended tasks  | Visual Question Answering, caption-style prompts |
         | 
| 32 | 
            +
             | 
| 33 | 
            +
            ---
         | 
| 34 | 
            +
             | 
| 35 | 
            +
            ## Introduction
         | 
| 36 | 
            +
             | 
| 37 | 
            +
            MicroLLaVA is a [TinyLLaVA Factory](https://github.com/TinyLLaVA/TinyLLaVA_Factory) based model that pairs a very small language model [`keeeeenw/MicroLlama`](https://huggingface.co/keeeeenw/MicroLlama) with an efficient SigLIP vision encoder.  
         | 
| 38 | 
            +
            The goal is to create a vision language model that almost anyone can train and iterate on with one consumer GPU.
         | 
| 39 | 
            +
             | 
| 40 | 
            +
            - **Language model**: [`keeeeenw/MicroLlama`](https://huggingface.co/keeeeenw/MicroLlama) with ~300M parameters  
         | 
| 41 | 
            +
            - **Vision encoder**: [`siglip-so400m-patch14-384`](https://huggingface.co/google/siglip-so400m-patch14-384)
         | 
| 42 | 
            +
            - **Training codebase**: [TinyLLaVA Factory](https://github.com/TinyLLaVA/TinyLLaVA_Factory) with additional changes in my fork: [Custom fork with training tweaks](https://github.com/keeeeenw/TinyLLaVA_Factory)
         | 
| 43 | 
            +
             | 
| 44 | 
            +
            ---
         | 
| 45 | 
            +
             | 
| 46 | 
            +
            ## Files included
         | 
| 47 | 
            +
             | 
| 48 | 
            +
            | File                       | Purpose |
         | 
| 49 | 
            +
            |----------------------------|---------|
         | 
| 50 | 
            +
            | `config.json`              | Model configuration for Transformers |
         | 
| 51 | 
            +
            | `generation_config.json`   | Generation defaults |
         | 
| 52 | 
            +
            | `model.safetensors`        | Weights |
         | 
| 53 | 
            +
            | `tokenizer.model`          | SentencePiece model |
         | 
| 54 | 
            +
            | `tokenizer_config.json`    | Tokenizer configuration |
         | 
| 55 | 
            +
            | `special_tokens_map.json`  | Special token mapping |
         | 
| 56 | 
            +
            | `trainer_state.json`       | Trainer state |
         | 
| 57 | 
            +
            | `training_args.bin`        | Training arguments |
         | 
| 58 | 
            +
            | `log.txt`                  | Training log |
         | 
| 59 | 
            +
             | 
| 60 | 
            +
            If your workflow uses a custom processor, also include `preprocessor_config.json` or `processor_config.json` so `AutoProcessor.from_pretrained` works.
         | 
| 61 | 
            +
             | 
| 62 | 
            +
            Because of its compact size, this model can be trained entirely on a single NVIDIA RTX 4090 without DeepSpeed.  
         | 
| 63 | 
            +
             | 
| 64 | 
            +
            Pretraining on **LAION-CC-SBU-558K** took about **5 hours** on a single NVIDIA RTX 4090 without DeepSpeed.
         | 
| 65 | 
            +
             | 
| 66 | 
            +
            Supervised finetuning on all datasets from the TinyLLaVA Factory guide (except `ocr_vqa`) took about **12 hours** on the same GPU.
         | 
| 67 | 
            +
             | 
| 68 | 
            +
            ---
         | 
| 69 | 
            +
             | 
| 70 | 
            +
            ## Quick start
         | 
| 71 | 
            +
             | 
| 72 | 
            +
            ```python
         | 
| 73 | 
            +
            from transformers import AutoTokenizer, AutoProcessor, AutoModelForCausalLM
         | 
| 74 | 
            +
            import torch
         | 
| 75 | 
            +
             | 
| 76 | 
            +
            repo_id = "keeeeenw/MicroLlava-siglip-so400m-patch14-384-base-finetune"
         | 
| 77 | 
            +
             | 
| 78 | 
            +
            tokenizer = AutoTokenizer.from_pretrained(repo_id)
         | 
| 79 | 
            +
             | 
| 80 | 
            +
            # If processor config is available
         | 
| 81 | 
            +
            try:
         | 
| 82 | 
            +
                processor = AutoProcessor.from_pretrained(repo_id)
         | 
| 83 | 
            +
            except Exception:
         | 
| 84 | 
            +
                processor = None  # Optional if images are preprocessed manually
         | 
| 85 | 
            +
             | 
| 86 | 
            +
            model = AutoModelForCausalLM.from_pretrained(
         | 
| 87 | 
            +
                repo_id,
         | 
| 88 | 
            +
                torch_dtype=torch.float16,
         | 
| 89 | 
            +
                device_map="auto",
         | 
| 90 | 
            +
                trust_remote_code=True  # Set to True if repo includes custom code
         | 
| 91 | 
            +
            )
         | 
| 92 | 
            +
             | 
| 93 | 
            +
            inputs = tokenizer("Describe the image in one sentence.", return_tensors="pt").to(model.device)
         | 
| 94 | 
            +
            output_ids = model.generate(**inputs, max_new_tokens=64)
         | 
| 95 | 
            +
            print(tokenizer.decode(output_ids[0], skip_special_tokens=True))
         | 
| 96 | 
            +
            ```
         | 
| 97 | 
            +
             | 
| 98 | 
            +
            ## Evaluation
         | 
| 99 | 
            +
             | 
| 100 | 
            +
            Evaluation results will be added in the coming days. Planned tests include:
         | 
| 101 | 
            +
             | 
| 102 | 
            +
            - VQAv2-style prompts for question answering  
         | 
| 103 | 
            +
            - and more 
         | 
| 104 | 
            +
             | 
| 105 | 
            +
            Community contributions with benchmark results are welcome and encouraged.
         | 
| 106 | 
            +
             | 
| 107 | 
            +
            ---
         | 
| 108 | 
            +
             | 
| 109 | 
            +
            ## Intended uses and limitations
         | 
| 110 | 
            +
             | 
| 111 | 
            +
            **Intended uses**
         | 
| 112 | 
            +
            - Rapid experimentation for vision-language research on limited hardware  
         | 
| 113 | 
            +
            - Educational demonstrations for students and hobbyists  
         | 
| 114 | 
            +
            - Starting point for domain-specific finetuning  
         | 
| 115 | 
            +
             | 
| 116 | 
            +
            **Limitations**
         | 
| 117 | 
            +
            - The small LLM size and compact vision encoder may limit reasoning depth and OCR performance  
         | 
| 118 | 
            +
            - Performance can vary significantly depending on the image domain and quality  
         | 
| 119 | 
            +
            - The model includes minimal safety filtering and refusal behavior — downstream applications should implement their own safeguards  
         | 
| 120 | 
            +
             | 
| 121 | 
            +
            > ⚠️ This model should not be used for applications that may cause harm or have significant safety, financial, legal, or medical implications without thorough human review.
         | 
| 122 | 
            +
             | 
| 123 | 
            +
            ---
         | 
| 124 | 
            +
             | 
| 125 | 
            +
            ## Reproducibility checklist
         | 
| 126 | 
            +
             | 
| 127 | 
            +
            To reproduce results and training runs:
         | 
| 128 | 
            +
             | 
| 129 | 
            +
            1. Fix all random seeds in training scripts  
         | 
| 130 | 
            +
            2. Record exact dataset versions and any filtering applied  
         | 
| 131 | 
            +
            3. Log optimizer type, learning rate schedule, precision settings, and gradient accumulation steps  
         | 
| 132 | 
            +
            4. Save the exact TinyLLaVA Factory commit or fork commit used for both pretraining and finetuning  
         | 
| 133 | 
            +
            5. Document hardware and software versions (CUDA, PyTorch, etc.)
         | 
| 134 | 
            +
             | 
| 135 | 
            +
            ---
         | 
| 136 | 
            +
             | 
| 137 | 
            +
            ## Citation
         | 
| 138 | 
            +
             | 
| 139 | 
            +
            ```bibtex
         | 
| 140 | 
            +
            @misc{wang2024microllama,
         | 
| 141 | 
            +
              title        = {MicroLLaVA: a TinyLLaVA based VLM with MicroLlama 300M for single GPU training},
         | 
| 142 | 
            +
              author       = {Zixiao Ken Wang},
         | 
| 143 | 
            +
              year         = {2025},
         | 
| 144 | 
            +
              url          = {https://huggingface.co/keeeeenw/MicroLlava-siglip-so400m-patch14-384-base-finetune}
         | 
| 145 | 
            +
            }
         | 
| 146 | 
            +
            ```
         | 
| 147 | 
            +
             | 
| 148 | 
            +
            ## License
         | 
| 149 | 
            +
             | 
| 150 | 
            +
            This model is released under the [Apache License 2.0](https://www.apache.org/licenses/LICENSE-2.0).  
         | 
| 151 | 
            +
             | 
| 152 | 
            +
            You are free to use, modify, and distribute this model and its derivatives, provided that you comply with the terms of the license.  
         | 
| 153 | 
            +
            If you use this model in your research or applications, please credit the original authors and clearly indicate any modifications you have made.  
         | 
| 154 | 
            +
             | 
| 155 | 
            +
            > **Note**: Ensure that the datasets used for pretraining or finetuning also allow redistribution of derived model weights.
         | 
| 156 | 
            +
             | 
| 157 | 
            +
            ---
         | 
| 158 | 
            +
             | 
| 159 | 
            +
            ## Acknowledgements
         | 
| 160 | 
            +
             | 
| 161 | 
            +
            This work builds upon the efforts of many in the open-source AI community:
         | 
| 162 | 
            +
             | 
| 163 | 
            +
            - **[TinyLLaVA Factory](https://github.com/TinyLLaVA/TinyLLaVA_Factory)** maintainers and contributors for creating the training framework  
         | 
| 164 | 
            +
            - **[`keeeeenw/MicroLlama`](https://huggingface.co/keeeeenw/MicroLlama)** I am also the creator of MicroLlama. Please help support my work!
         | 
| 165 | 
            +
            - **SigLIP** authors for the efficient vision encoder architecture  
         | 
| 166 | 
            +
            - Contributors to **LAION-CC-SBU-558K** and other datasets used in pretraining and finetuning  
         | 
| 167 | 
            +
            - The Hugging Face ecosystem for hosting, tools, and community support
         | 
| 168 | 
            +
             |