Spaces:
				
			
			
	
			
			
		Runtime error
		
	
	
	
			
			
	
	
	
	
		
		
		Runtime error
		
	Commit 
							
							Β·
						
						22734d5
	
1
								Parent(s):
							
							f8621f1
								
Update README.md
Browse files
    	
        README.md
    CHANGED
    
    | @@ -1,139 +1,13 @@ | |
| 1 | 
            -
             | 
| 2 | 
            -
             | 
| 3 | 
            -
             | 
| 4 | 
            -
             | 
| 5 | 
            -
             | 
| 6 | 
            -
             | 
| 7 | 
            -
             | 
| 8 | 
            -
             | 
| 9 | 
            -
             | 
| 10 | 
            -
             | 
| 11 | 
            -
             | 
| 12 | 
            -
             | 
| 13 | 
            -
             | 
| 14 | 
            -
            ## Examples
         | 
| 15 | 
            -
              |   |   |
         | 
| 16 | 
            -
            :-------------------------:|:-------------------------:
         | 
| 17 | 
            -
             |  
         | 
| 18 | 
            -
              |  
         | 
| 19 | 
            -
             | 
| 20 | 
            -
             | 
| 21 | 
            -
             | 
| 22 | 
            -
             | 
| 23 | 
            -
             | 
| 24 | 
            -
            ## Abstract
         | 
| 25 | 
            -
            The recent GPT-4 has demonstrated extraordinary multi-modal abilities, such as directly generating websites from handwritten text and identifying humorous elements within images. These features are rarely observed in previous vision-language models. We believe the primary reason for GPT-4's advanced multi-modal generation capabilities lies in the utilization of a more advanced large language model (LLM). To examine this phenomenon, we present MiniGPT-4, which aligns a frozen visual encoder with a frozen LLM, Vicuna, using just one projection layer. 
         | 
| 26 | 
            -
            Our findings reveal that MiniGPT-4 processes many capabilities similar to those exhibited by GPT-4 like detailed image description generation and website creation from hand-written drafts. Furthermore, we also observe other emerging capabilities in MiniGPT-4, including writing stories and poems inspired by given images, providing solutions to problems shown in images, teaching users how to cook based on food photos, etc. 
         | 
| 27 | 
            -
            These advanced capabilities can be attributed to the use of a more advanced large language model.
         | 
| 28 | 
            -
            Furthermore, our method is computationally efficient, as we only train a projection layer using roughly 5 million aligned image-text pairs and an additional 3,500 carefully curated high-quality pairs. 
         | 
| 29 | 
            -
             | 
| 30 | 
            -
             | 
| 31 | 
            -
             | 
| 32 | 
            -
             | 
| 33 | 
            -
             | 
| 34 | 
            -
             | 
| 35 | 
            -
             | 
| 36 | 
            -
             | 
| 37 | 
            -
            ## Getting Started
         | 
| 38 | 
            -
            ### Installation
         | 
| 39 | 
            -
             | 
| 40 | 
            -
            1. Prepare the code and the environment
         | 
| 41 | 
            -
             | 
| 42 | 
            -
            Git clone our repository, creating a python environment and ativate it via the following command
         | 
| 43 | 
            -
             | 
| 44 | 
            -
            ```bash
         | 
| 45 | 
            -
            git clone https://github.com/Vision-CAIR/MiniGPT-4.git
         | 
| 46 | 
            -
            cd MiniGPT-4
         | 
| 47 | 
            -
            conda env create -f environment.yml
         | 
| 48 | 
            -
            conda activate minigpt4
         | 
| 49 | 
            -
            ```
         | 
| 50 | 
            -
             | 
| 51 | 
            -
             | 
| 52 | 
            -
            2. Prepare the pretrained Vicuna weights
         | 
| 53 | 
            -
             | 
| 54 | 
            -
            The current version of MiniGPT-4 is built on the v0 versoin of Vicuna-13B.
         | 
| 55 | 
            -
            Please refer to their instructions [here](https://huggingface.co/lmsys/vicuna-13b-delta-v0) to obtaining the weights.
         | 
| 56 | 
            -
            The final weights would be in a single folder with the following structure:
         | 
| 57 | 
            -
             | 
| 58 | 
            -
            ```
         | 
| 59 | 
            -
            vicuna_weights
         | 
| 60 | 
            -
            βββ config.json
         | 
| 61 | 
            -
            βββ generation_config.json
         | 
| 62 | 
            -
            βββ pytorch_model.bin.index.json
         | 
| 63 | 
            -
            βββ pytorch_model-00001-of-00003.bin
         | 
| 64 | 
            -
            ...   
         | 
| 65 | 
            -
            ```
         | 
| 66 | 
            -
             | 
| 67 | 
            -
            Then, set the path to the vicuna weight in the model config file 
         | 
| 68 | 
            -
            [here](minigpt4/configs/models/minigpt4.yaml#L21) at Line 21.
         | 
| 69 | 
            -
             | 
| 70 | 
            -
            3. Prepare the pretrained MiniGPT-4 checkpoint
         | 
| 71 | 
            -
             | 
| 72 | 
            -
            To play with our pretrained model, download the pretrained checkpoint 
         | 
| 73 | 
            -
            [here](https://drive.google.com/file/d/1a4zLvaiDBr-36pasffmgpvH5P7CKmpze/view?usp=share_link).
         | 
| 74 | 
            -
            Then, set the path to the pretrained checkpoint in the evaluation config file 
         | 
| 75 | 
            -
            in [eval_configs/minigpt4.yaml](eval_configs/minigpt4.yaml#L15) at Line 15. 
         | 
| 76 | 
            -
             | 
| 77 | 
            -
             | 
| 78 | 
            -
             | 
| 79 | 
            -
             | 
| 80 | 
            -
             | 
| 81 | 
            -
            ### Launching Demo Locally
         | 
| 82 | 
            -
             | 
| 83 | 
            -
            Try out our demo [demo.py](app.py) with your images for on your local machine by running
         | 
| 84 | 
            -
             | 
| 85 | 
            -
            ```
         | 
| 86 | 
            -
            python demo.py --cfg-path eval_configs/minigpt4.yaml
         | 
| 87 | 
            -
            ```
         | 
| 88 | 
            -
             | 
| 89 | 
            -
             | 
| 90 | 
            -
             | 
| 91 | 
            -
             | 
| 92 | 
            -
             | 
| 93 | 
            -
            ### Training
         | 
| 94 | 
            -
            The training of MiniGPT-4 contains two-stage alignments.
         | 
| 95 | 
            -
            In the first stage, the model is trained using image-text pairs from Laion and CC datasets
         | 
| 96 | 
            -
            to align the vision and language model. To download and prepare the datasets, please check 
         | 
| 97 | 
            -
            [here](dataset/readme.md). 
         | 
| 98 | 
            -
            After the first stage, the visual features are mapped and can be understood by the language
         | 
| 99 | 
            -
            model.
         | 
| 100 | 
            -
            To launch the first stage training, run 
         | 
| 101 | 
            -
             | 
| 102 | 
            -
            ```bash
         | 
| 103 | 
            -
            torchrun --nproc-per-node NUM_GPU train.py --cfg-path train_config/minigpt4_stage1_laion.yaml
         | 
| 104 | 
            -
            ```
         | 
| 105 | 
            -
             | 
| 106 | 
            -
            In the second stage, we use a small high quality image-text pair dataset created by ourselves
         | 
| 107 | 
            -
            and convert it to a conversation format to further align MiniGPT-4.
         | 
| 108 | 
            -
            Our second stage dataset can be download from 
         | 
| 109 | 
            -
            [here](https://drive.google.com/file/d/1RnS0mQJj8YU0E--sfH08scu5-ALxzLNj/view?usp=share_link).
         | 
| 110 | 
            -
            After the second stage alignment, MiniGPT-4 is able to talk about the image in
         | 
| 111 | 
            -
            a smooth way. 
         | 
| 112 | 
            -
            To launch the second stage alignment, run
         | 
| 113 | 
            -
             | 
| 114 | 
            -
            ```bash
         | 
| 115 | 
            -
            torchrun --nproc-per-node NUM_GPU train.py --cfg-path train_config/minigpt4_stage2_align.yaml
         | 
| 116 | 
            -
            ```
         | 
| 117 | 
            -
             | 
| 118 | 
            -
             | 
| 119 | 
            -
             | 
| 120 | 
            -
             | 
| 121 | 
            -
             | 
| 122 | 
            -
            ## Acknowledgement
         | 
| 123 | 
            -
             | 
| 124 | 
            -
            + [BLIP2](https://huggingface.co/docs/transformers/main/model_doc/blip-2)
         | 
| 125 | 
            -
            + [Vicuna](https://github.com/lm-sys/FastChat)
         | 
| 126 | 
            -
             | 
| 127 | 
            -
             | 
| 128 | 
            -
            If you're using MiniGPT-4 in your research or applications, please cite using this BibTeX:
         | 
| 129 | 
            -
            ```bibtex
         | 
| 130 | 
            -
            @misc{zhu2022minigpt4,
         | 
| 131 | 
            -
                  title={MiniGPT-4: Enhancing the Vision-language Understanding with Advanced Large Language Models}, 
         | 
| 132 | 
            -
                  author={Deyao Zhu and Jun Chen and Xiaoqian Shen and xiang Li and Mohamed Elhoseiny},
         | 
| 133 | 
            -
                  year={2023},
         | 
| 134 | 
            -
            }
         | 
| 135 | 
            -
            ```
         | 
| 136 | 
            -
             | 
| 137 | 
            -
            ## License
         | 
| 138 | 
            -
            This repository is built on [Lavis](https://github.com/salesforce/LAVIS) with BSD 3-Clause License
         | 
| 139 | 
            -
            [BSD 3-Clause License](LICENSE.txt)
         | 
|  | |
| 1 | 
            +
            ---
         | 
| 2 | 
            +
            title: MiniGPT
         | 
| 3 | 
            +
            emoji: π
         | 
| 4 | 
            +
            colorFrom: purple
         | 
| 5 | 
            +
            colorTo: gray
         | 
| 6 | 
            +
            sdk: gradio
         | 
| 7 | 
            +
            sdk_version: 3.17.0
         | 
| 8 | 
            +
            app_file: app.py
         | 
| 9 | 
            +
            pinned: false
         | 
| 10 | 
            +
            license: other
         | 
| 11 | 
            +
            ---
         | 
| 12 | 
            +
             | 
| 13 | 
            +
            Check out the configuration reference at https://huggingface.co/docs/hub/spaces-config-reference
         | 
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  |