Upload folder using huggingface_hub
Browse files- README.md +29 -22
- config.json +10 -5
- conversation.py +0 -1
- modeling_intern_vit.py +7 -7
- modeling_internlm2.py +10 -10
- modeling_internvl_chat.py +4 -4
- preprocessor_config.json +1 -1
    	
        README.md
    CHANGED
    
    | @@ -1,55 +1,57 @@ | |
| 1 | 
             
            ---
         | 
| 2 | 
             
            license: mit
         | 
| 3 | 
             
            datasets:
         | 
| 4 | 
            -
            - laion/laion2B-en
         | 
| 5 | 
            -
            - laion/laion-coco
         | 
| 6 | 
            -
            - laion/laion2B-multi
         | 
| 7 | 
            -
            - kakaobrain/coyo-700m
         | 
| 8 | 
            -
            - conceptual_captions
         | 
| 9 | 
            -
            - wanng/wukong100m
         | 
| 10 | 
             
            pipeline_tag: visual-question-answering
         | 
| 11 | 
             
            ---
         | 
| 12 |  | 
| 13 | 
             
            # Model Card for Mini-InternVL-Chat-2B-V1-5
         | 
|  | |
| 14 | 
             
            <p align="center">
         | 
| 15 | 
             
              <img src="https://cdn-uploads.huggingface.co/production/uploads/64119264f0f81eb569e0d569/D60YzQBIzvoCvLRp2gZ0A.jpeg" alt="Image Description" width="300" height="300" />
         | 
| 16 | 
             
            </p>
         | 
| 17 |  | 
| 18 | 
             
            > _Two interns holding hands, symbolizing the integration of InternViT and InternLM._
         | 
| 19 |  | 
| 20 | 
            -
            \[[InternVL 1.5 Technical Report](https://arxiv.org/abs/2404.16821)\]  \[[CVPR Paper](https://arxiv.org/abs/2312.14238)\]  \[[GitHub](https://github.com/OpenGVLab/InternVL)\] \[[Chat Demo](https://internvl.opengvlab.com/)\] \[[中文解读](https://zhuanlan.zhihu.com/p/675877376)]
         | 
| 21 |  | 
| 22 | 
             
            You can run multimodal large models using a 1080Ti now.
         | 
| 23 |  | 
| 24 | 
             
            We are delighted to introduce the Mini-InternVL-Chat series. In the era of large language models, many researchers have started to focus on smaller language models, such as Gemma-2B, Qwen-1.8B, and InternLM2-1.8B. Inspired by their efforts, we have distilled our vision foundation model [InternViT-6B-448px-V1-5](https://huggingface.co/OpenGVLab/InternViT-6B-448px-V1-5) down to 300M and used [InternLM2-Chat-1.8B](https://huggingface.co/internlm/internlm2-chat-1_8b) or [Phi-3-mini-128k-instruct](https://huggingface.co/microsoft/Phi-3-mini-128k-instruct) as our language model. This resulted in a small multimodal model with excellent performance.
         | 
| 25 |  | 
| 26 | 
            -
             | 
| 27 | 
             
            As shown in the figure below, we adopted the same model architecture as InternVL 1.5. We simply replaced the original InternViT-6B with InternViT-300M and InternLM2-Chat-20B with InternLM2-Chat-1.8B / Phi-3-mini-128k-instruct. For training, we used the same data as InternVL 1.5 to train this smaller model. Additionally, due to the lower training costs of smaller models, we used a context length of 8K during training.
         | 
| 28 |  | 
| 29 | 
            -
             | 
| 30 | 
             
            
         | 
| 31 |  | 
| 32 | 
            -
             | 
| 33 | 
             
            ## Model Details
         | 
|  | |
| 34 | 
             
            - **Model Type:** multimodal large language model (MLLM)
         | 
|  | |
| 35 | 
             
            - **Model Stats:**
         | 
|  | |
| 36 | 
             
              - Architecture: [InternViT-300M-448px](https://huggingface.co/OpenGVLab/InternViT-300M-448px) + MLP + [InternLM2-Chat-1.8B](https://huggingface.co/internlm/internlm2-chat-1_8b)
         | 
| 37 | 
             
              - Image size: dynamic resolution, max to 40 tiles of 448 x 448 (4K resolution).
         | 
| 38 | 
             
              - Params: 2.2B
         | 
| 39 |  | 
| 40 | 
             
            - **Training Strategy:**
         | 
|  | |
| 41 | 
             
              - Learnable component in the pretraining stage: ViT + MLP
         | 
| 42 | 
             
              - Learnable component in the finetuning stage: ViT + MLP + LLM
         | 
| 43 | 
            -
              - For more details on training hyperparameters, take a look at our code: [pretrain]() | [finetune]()
         | 
| 44 | 
            -
             | 
| 45 | 
             
            ## Released Models
         | 
| 46 |  | 
| 47 | 
            -
            | | 
| 48 | 
            -
            |  | 
| 49 | 
            -
            | | 
| 50 | 
            -
            | InternVL-Chat-V1.2-Plus(🤗 [HF link](https://huggingface.co/OpenGVLab/InternVL-Chat-V1-2-Plus) ) |InternViT-6B-448px-V1-2(🤗 [HF link](https://huggingface.co/OpenGVLab/InternViT-6B-448px-V1-2)) | 
| 51 | 
            -
            | | 
| 52 | 
            -
            | | 
| 53 |  | 
| 54 | 
             
            ## Performance
         | 
| 55 |  | 
| @@ -59,7 +61,7 @@ As shown in the figure below, we adopted the same model architecture as InternVL | |
| 59 |  | 
| 60 | 
             
            We provide an example code to run Mini-InternVL-Chat-2B-V1.5 using `transformers`.
         | 
| 61 |  | 
| 62 | 
            -
            You can also use our [online demo](https://internvl.opengvlab.com/)  | 
| 63 |  | 
| 64 | 
             
            > Please use transformers==4.37.2 to ensure the model works normally.
         | 
| 65 |  | 
| @@ -150,7 +152,6 @@ def load_image(image_file, input_size=448, max_num=6): | |
| 150 | 
             
                pixel_values = torch.stack(pixel_values)
         | 
| 151 | 
             
                return pixel_values
         | 
| 152 |  | 
| 153 | 
            -
             | 
| 154 | 
             
            path = "OpenGVLab/Mini-InternVL-Chat-2B-V1-5"
         | 
| 155 | 
             
            model = AutoModel.from_pretrained(
         | 
| 156 | 
             
                path,
         | 
| @@ -222,12 +223,18 @@ If you find this project useful in your research, please consider citing: | |
| 222 | 
             
              journal={arXiv preprint arXiv:2312.14238},
         | 
| 223 | 
             
              year={2023}
         | 
| 224 | 
             
            }
         | 
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
| 225 | 
             
            ```
         | 
| 226 |  | 
| 227 | 
             
            ## License
         | 
| 228 |  | 
| 229 | 
            -
            This project is released under the MIT license. | 
| 230 |  | 
| 231 | 
             
            ## Acknowledgement
         | 
| 232 |  | 
| 233 | 
            -
            InternVL is built with reference to the code of the following projects: [OpenAI CLIP](https://github.com/openai/CLIP), [Open CLIP](https://github.com/mlfoundations/open_clip), [CLIP Benchmark](https://github.com/LAION-AI/CLIP_benchmark), [EVA](https://github.com/baaivision/EVA/tree/master), [InternImage](https://github.com/OpenGVLab/InternImage), [ViT-Adapter](https://github.com/czczup/ViT-Adapter), [MMSegmentation](https://github.com/open-mmlab/mmsegmentation), [Transformers](https://github.com/huggingface/transformers), [DINOv2](https://github.com/facebookresearch/dinov2), [BLIP-2](https://github.com/salesforce/LAVIS/tree/main/projects/blip2), [Qwen-VL](https://github.com/QwenLM/Qwen-VL/tree/master/eval_mm), and [LLaVA-1.5](https://github.com/haotian-liu/LLaVA). Thanks for their awesome work!
         | 
|  | |
| 1 | 
             
            ---
         | 
| 2 | 
             
            license: mit
         | 
| 3 | 
             
            datasets:
         | 
| 4 | 
            +
              - laion/laion2B-en
         | 
| 5 | 
            +
              - laion/laion-coco
         | 
| 6 | 
            +
              - laion/laion2B-multi
         | 
| 7 | 
            +
              - kakaobrain/coyo-700m
         | 
| 8 | 
            +
              - conceptual_captions
         | 
| 9 | 
            +
              - wanng/wukong100m
         | 
| 10 | 
             
            pipeline_tag: visual-question-answering
         | 
| 11 | 
             
            ---
         | 
| 12 |  | 
| 13 | 
             
            # Model Card for Mini-InternVL-Chat-2B-V1-5
         | 
| 14 | 
            +
             | 
| 15 | 
             
            <p align="center">
         | 
| 16 | 
             
              <img src="https://cdn-uploads.huggingface.co/production/uploads/64119264f0f81eb569e0d569/D60YzQBIzvoCvLRp2gZ0A.jpeg" alt="Image Description" width="300" height="300" />
         | 
| 17 | 
             
            </p>
         | 
| 18 |  | 
| 19 | 
             
            > _Two interns holding hands, symbolizing the integration of InternViT and InternLM._
         | 
| 20 |  | 
| 21 | 
            +
            \[[InternVL 1.5 Technical Report](https://arxiv.org/abs/2404.16821)\]  \[[CVPR Paper](https://arxiv.org/abs/2312.14238)\]  \[[GitHub](https://github.com/OpenGVLab/InternVL)\] \[[Chat Demo](https://internvl.opengvlab.com/)\] \[[中文解读](https://zhuanlan.zhihu.com/p/675877376)\]
         | 
| 22 |  | 
| 23 | 
             
            You can run multimodal large models using a 1080Ti now.
         | 
| 24 |  | 
| 25 | 
             
            We are delighted to introduce the Mini-InternVL-Chat series. In the era of large language models, many researchers have started to focus on smaller language models, such as Gemma-2B, Qwen-1.8B, and InternLM2-1.8B. Inspired by their efforts, we have distilled our vision foundation model [InternViT-6B-448px-V1-5](https://huggingface.co/OpenGVLab/InternViT-6B-448px-V1-5) down to 300M and used [InternLM2-Chat-1.8B](https://huggingface.co/internlm/internlm2-chat-1_8b) or [Phi-3-mini-128k-instruct](https://huggingface.co/microsoft/Phi-3-mini-128k-instruct) as our language model. This resulted in a small multimodal model with excellent performance.
         | 
| 26 |  | 
|  | |
| 27 | 
             
            As shown in the figure below, we adopted the same model architecture as InternVL 1.5. We simply replaced the original InternViT-6B with InternViT-300M and InternLM2-Chat-20B with InternLM2-Chat-1.8B / Phi-3-mini-128k-instruct. For training, we used the same data as InternVL 1.5 to train this smaller model. Additionally, due to the lower training costs of smaller models, we used a context length of 8K during training.
         | 
| 28 |  | 
|  | |
| 29 | 
             
            
         | 
| 30 |  | 
|  | |
| 31 | 
             
            ## Model Details
         | 
| 32 | 
            +
             | 
| 33 | 
             
            - **Model Type:** multimodal large language model (MLLM)
         | 
| 34 | 
            +
             | 
| 35 | 
             
            - **Model Stats:**
         | 
| 36 | 
            +
             | 
| 37 | 
             
              - Architecture: [InternViT-300M-448px](https://huggingface.co/OpenGVLab/InternViT-300M-448px) + MLP + [InternLM2-Chat-1.8B](https://huggingface.co/internlm/internlm2-chat-1_8b)
         | 
| 38 | 
             
              - Image size: dynamic resolution, max to 40 tiles of 448 x 448 (4K resolution).
         | 
| 39 | 
             
              - Params: 2.2B
         | 
| 40 |  | 
| 41 | 
             
            - **Training Strategy:**
         | 
| 42 | 
            +
             | 
| 43 | 
             
              - Learnable component in the pretraining stage: ViT + MLP
         | 
| 44 | 
             
              - Learnable component in the finetuning stage: ViT + MLP + LLM
         | 
| 45 | 
            +
              - For more details on training hyperparameters, take a look at our code: [pretrain](<>) | [finetune](<>)
         | 
| 46 | 
            +
             | 
| 47 | 
             
            ## Released Models
         | 
| 48 |  | 
| 49 | 
            +
            |                                              Model                                               |                                     Vision Foundation Model                                     | Release Date | Note                                                                                                                                                               |
         | 
| 50 | 
            +
            | :----------------------------------------------------------------------------------------------: | :---------------------------------------------------------------------------------------------: | :----------: | :----------------------------------------------------------------------------------------------------------------------------------------------------------------- |
         | 
| 51 | 
            +
            |      InternVL-Chat-V1.5(🤗 [HF link](https://huggingface.co/OpenGVLab/InternVL-Chat-V1-5))       | InternViT-6B-448px-V1-5(🤗 [HF link](https://huggingface.co/OpenGVLab/InternViT-6B-448px-V1-5)) |  2024.04.18  | support 4K image; super strong OCR; Approaching the performance of GPT-4V and Gemini Pro on various benchmarks like MMMU, DocVQA, ChartQA, MathVista, etc. (🔥new) |
         | 
| 52 | 
            +
            | InternVL-Chat-V1.2-Plus(🤗 [HF link](https://huggingface.co/OpenGVLab/InternVL-Chat-V1-2-Plus) ) | InternViT-6B-448px-V1-2(🤗 [HF link](https://huggingface.co/OpenGVLab/InternViT-6B-448px-V1-2)) |  2024.02.21  | more SFT data and stronger                                                                                                                                         |
         | 
| 53 | 
            +
            |      InternVL-Chat-V1.2(🤗 [HF link](https://huggingface.co/OpenGVLab/InternVL-Chat-V1-2) )      | InternViT-6B-448px-V1-2(🤗 [HF link](https://huggingface.co/OpenGVLab/InternViT-6B-448px-V1-2)) |  2024.02.11  | scaling up LLM to 34B                                                                                                                                              |
         | 
| 54 | 
            +
            |      InternVL-Chat-V1.1(🤗 [HF link](https://huggingface.co/OpenGVLab/InternVL-Chat-V1-1))       | InternViT-6B-448px-V1-0(🤗 [HF link](https://huggingface.co/OpenGVLab/InternViT-6B-448px-V1-0)) |  2024.01.24  | support Chinese and stronger OCR                                                                                                                                   |
         | 
| 55 |  | 
| 56 | 
             
            ## Performance
         | 
| 57 |  | 
|  | |
| 61 |  | 
| 62 | 
             
            We provide an example code to run Mini-InternVL-Chat-2B-V1.5 using `transformers`.
         | 
| 63 |  | 
| 64 | 
            +
            You can also use our [online demo](https://internvl.opengvlab.com/) for a quick experience of this model.
         | 
| 65 |  | 
| 66 | 
             
            > Please use transformers==4.37.2 to ensure the model works normally.
         | 
| 67 |  | 
|  | |
| 152 | 
             
                pixel_values = torch.stack(pixel_values)
         | 
| 153 | 
             
                return pixel_values
         | 
| 154 |  | 
|  | |
| 155 | 
             
            path = "OpenGVLab/Mini-InternVL-Chat-2B-V1-5"
         | 
| 156 | 
             
            model = AutoModel.from_pretrained(
         | 
| 157 | 
             
                path,
         | 
|  | |
| 223 | 
             
              journal={arXiv preprint arXiv:2312.14238},
         | 
| 224 | 
             
              year={2023}
         | 
| 225 | 
             
            }
         | 
| 226 | 
            +
            @article{chen2024far,
         | 
| 227 | 
            +
              title={How Far Are We to GPT-4V? Closing the Gap to Commercial Multimodal Models with Open-Source Suites},
         | 
| 228 | 
            +
              author={Chen, Zhe and Wang, Weiyun and Tian, Hao and Ye, Shenglong and Gao, Zhangwei and Cui, Erfei and Tong, Wenwen and Hu, Kongzhi and Luo, Jiapeng and Ma, Zheng and others},
         | 
| 229 | 
            +
              journal={arXiv preprint arXiv:2404.16821},
         | 
| 230 | 
            +
              year={2024}
         | 
| 231 | 
            +
            }
         | 
| 232 | 
             
            ```
         | 
| 233 |  | 
| 234 | 
             
            ## License
         | 
| 235 |  | 
| 236 | 
            +
            This project is released under the MIT license.
         | 
| 237 |  | 
| 238 | 
             
            ## Acknowledgement
         | 
| 239 |  | 
| 240 | 
            +
            InternVL is built with reference to the code of the following projects: [OpenAI CLIP](https://github.com/openai/CLIP), [Open CLIP](https://github.com/mlfoundations/open_clip), [CLIP Benchmark](https://github.com/LAION-AI/CLIP_benchmark), [EVA](https://github.com/baaivision/EVA/tree/master), [InternImage](https://github.com/OpenGVLab/InternImage), [ViT-Adapter](https://github.com/czczup/ViT-Adapter), [MMSegmentation](https://github.com/open-mmlab/mmsegmentation), [Transformers](https://github.com/huggingface/transformers), [DINOv2](https://github.com/facebookresearch/dinov2), [BLIP-2](https://github.com/salesforce/LAVIS/tree/main/projects/blip2), [Qwen-VL](https://github.com/QwenLM/Qwen-VL/tree/master/eval_mm), and [LLaVA-1.5](https://github.com/haotian-liu/LLaVA). Thanks for their awesome work!
         | 
    	
        config.json
    CHANGED
    
    | @@ -6,13 +6,14 @@ | |
| 6 | 
             
              ],
         | 
| 7 | 
             
              "auto_map": {
         | 
| 8 | 
             
                "AutoConfig": "configuration_internvl_chat.InternVLChatConfig",
         | 
| 9 | 
            -
                "AutoModel": "modeling_internvl_chat.InternVLChatModel"
         | 
|  | |
| 10 | 
             
              },
         | 
| 11 | 
             
              "downsample_ratio": 0.5,
         | 
| 12 | 
             
              "dynamic_image_size": true,
         | 
| 13 | 
             
              "force_image_size": 448,
         | 
| 14 | 
             
              "llm_config": {
         | 
| 15 | 
            -
                "_name_or_path": " | 
| 16 | 
             
                "add_cross_attention": false,
         | 
| 17 | 
             
                "architectures": [
         | 
| 18 | 
             
                  "InternLM2ForCausalLM"
         | 
| @@ -113,12 +114,16 @@ | |
| 113 | 
             
              "use_llm_lora": 0,
         | 
| 114 | 
             
              "use_thumbnail": true,
         | 
| 115 | 
             
              "vision_config": {
         | 
| 116 | 
            -
                "_name_or_path": "",
         | 
| 117 | 
             
                "add_cross_attention": false,
         | 
| 118 | 
             
                "architectures": [
         | 
| 119 | 
             
                  "InternVisionModel"
         | 
| 120 | 
             
                ],
         | 
| 121 | 
             
                "attention_dropout": 0.0,
         | 
|  | |
|  | |
|  | |
|  | |
| 122 | 
             
                "bad_words_ids": null,
         | 
| 123 | 
             
                "begin_suppress_tokens": null,
         | 
| 124 | 
             
                "bos_token_id": null,
         | 
| @@ -189,11 +194,11 @@ | |
| 189 | 
             
                "tokenizer_class": null,
         | 
| 190 | 
             
                "top_k": 50,
         | 
| 191 | 
             
                "top_p": 1.0,
         | 
| 192 | 
            -
                "torch_dtype": " | 
| 193 | 
             
                "torchscript": false,
         | 
| 194 | 
             
                "transformers_version": "4.36.2",
         | 
| 195 | 
             
                "typical_p": 1.0,
         | 
| 196 | 
            -
                "use_bfloat16":  | 
| 197 | 
             
                "use_flash_attn": true
         | 
| 198 | 
             
              }
         | 
| 199 | 
             
            }
         | 
|  | |
| 6 | 
             
              ],
         | 
| 7 | 
             
              "auto_map": {
         | 
| 8 | 
             
                "AutoConfig": "configuration_internvl_chat.InternVLChatConfig",
         | 
| 9 | 
            +
                "AutoModel": "modeling_internvl_chat.InternVLChatModel",
         | 
| 10 | 
            +
                "AutoModelForCausalLM": "modeling_internvl_chat.InternVLChatModel"
         | 
| 11 | 
             
              },
         | 
| 12 | 
             
              "downsample_ratio": 0.5,
         | 
| 13 | 
             
              "dynamic_image_size": true,
         | 
| 14 | 
             
              "force_image_size": 448,
         | 
| 15 | 
             
              "llm_config": {
         | 
| 16 | 
            +
                "_name_or_path": "pretrained/internlm2-chat-1_8b",
         | 
| 17 | 
             
                "add_cross_attention": false,
         | 
| 18 | 
             
                "architectures": [
         | 
| 19 | 
             
                  "InternLM2ForCausalLM"
         | 
|  | |
| 114 | 
             
              "use_llm_lora": 0,
         | 
| 115 | 
             
              "use_thumbnail": true,
         | 
| 116 | 
             
              "vision_config": {
         | 
| 117 | 
            +
                "_name_or_path": "OpenGVLab/InternViT-300M-448px",
         | 
| 118 | 
             
                "add_cross_attention": false,
         | 
| 119 | 
             
                "architectures": [
         | 
| 120 | 
             
                  "InternVisionModel"
         | 
| 121 | 
             
                ],
         | 
| 122 | 
             
                "attention_dropout": 0.0,
         | 
| 123 | 
            +
                "auto_map": {
         | 
| 124 | 
            +
                  "AutoConfig": "configuration_intern_vit.InternVisionConfig",
         | 
| 125 | 
            +
                  "AutoModel": "modeling_intern_vit.InternVisionModel"
         | 
| 126 | 
            +
                },
         | 
| 127 | 
             
                "bad_words_ids": null,
         | 
| 128 | 
             
                "begin_suppress_tokens": null,
         | 
| 129 | 
             
                "bos_token_id": null,
         | 
|  | |
| 194 | 
             
                "tokenizer_class": null,
         | 
| 195 | 
             
                "top_k": 50,
         | 
| 196 | 
             
                "top_p": 1.0,
         | 
| 197 | 
            +
                "torch_dtype": "bfloat16",
         | 
| 198 | 
             
                "torchscript": false,
         | 
| 199 | 
             
                "transformers_version": "4.36.2",
         | 
| 200 | 
             
                "typical_p": 1.0,
         | 
| 201 | 
            +
                "use_bfloat16": true,
         | 
| 202 | 
             
                "use_flash_attn": true
         | 
| 203 | 
             
              }
         | 
| 204 | 
             
            }
         | 
    	
        conversation.py
    CHANGED
    
    | @@ -1258,4 +1258,3 @@ register_conv_template( | |
| 1258 | 
             
                    sep2='</s>',
         | 
| 1259 | 
             
                )
         | 
| 1260 | 
             
            )
         | 
| 1261 | 
            -
             | 
|  | |
| 1258 | 
             
                    sep2='</s>',
         | 
| 1259 | 
             
                )
         | 
| 1260 | 
             
            )
         | 
|  | 
    	
        modeling_intern_vit.py
    CHANGED
    
    | @@ -26,9 +26,9 @@ try: | |
| 26 | 
             
                except:  # v2
         | 
| 27 | 
             
                    from flash_attn.flash_attn_interface import \
         | 
| 28 | 
             
                        flash_attn_varlen_qkvpacked_func as flash_attn_unpadded_qkvpacked_func
         | 
| 29 | 
            -
             | 
| 30 | 
             
                from flash_attn.bert_padding import pad_input, unpad_input
         | 
| 31 | 
            -
             | 
| 32 | 
             
                has_flash_attn = True
         | 
| 33 | 
             
            except:
         | 
| 34 | 
             
                print('FlashAttention is not installed.')
         | 
| @@ -47,12 +47,12 @@ class FlashAttention(nn.Module): | |
| 47 | 
             
                    attention_dropout: The dropout rate to apply to the attention
         | 
| 48 | 
             
                                       (default: 0.0)
         | 
| 49 | 
             
                """
         | 
| 50 | 
            -
             | 
| 51 | 
             
                def __init__(self, softmax_scale=None, attention_dropout=0.0, device=None, dtype=None):
         | 
| 52 | 
             
                    super().__init__()
         | 
| 53 | 
             
                    self.softmax_scale = softmax_scale
         | 
| 54 | 
             
                    self.dropout_p = attention_dropout
         | 
| 55 | 
            -
             | 
| 56 | 
             
                def forward(self, qkv, key_padding_mask=None, causal=False, cu_seqlens=None,
         | 
| 57 | 
             
                            max_s=None, need_weights=False):
         | 
| 58 | 
             
                    """Implements the multihead softmax attention.
         | 
| @@ -65,7 +65,7 @@ class FlashAttention(nn.Module): | |
| 65 | 
             
                    assert not need_weights
         | 
| 66 | 
             
                    assert qkv.dtype in [torch.float16, torch.bfloat16]
         | 
| 67 | 
             
                    assert qkv.is_cuda
         | 
| 68 | 
            -
             | 
| 69 | 
             
                    if cu_seqlens is None:
         | 
| 70 | 
             
                        batch_size = qkv.shape[0]
         | 
| 71 | 
             
                        seqlen = qkv.shape[1]
         | 
| @@ -97,7 +97,7 @@ class FlashAttention(nn.Module): | |
| 97 | 
             
                            qkv, cu_seqlens, max_s, self.dropout_p if self.training else 0.0,
         | 
| 98 | 
             
                            softmax_scale=self.softmax_scale, causal=causal
         | 
| 99 | 
             
                        )
         | 
| 100 | 
            -
             | 
| 101 | 
             
                    return output, None
         | 
| 102 |  | 
| 103 |  | 
| @@ -160,7 +160,7 @@ class InternVisionEmbeddings(nn.Module): | |
| 160 | 
             
                    target_dtype = pos_embed.dtype
         | 
| 161 | 
             
                    pos_embed = pos_embed.float().reshape(
         | 
| 162 | 
             
                        1, self.image_size // self.patch_size, self.image_size // self.patch_size, -1).permute(0, 3, 1, 2)
         | 
| 163 | 
            -
                    pos_embed = F.interpolate(pos_embed, size=(H, W), mode='bicubic', align_corners=False) | 
| 164 | 
             
                        reshape(1, -1, H * W).permute(0, 2, 1).to(target_dtype)
         | 
| 165 | 
             
                    return pos_embed
         | 
| 166 |  | 
|  | |
| 26 | 
             
                except:  # v2
         | 
| 27 | 
             
                    from flash_attn.flash_attn_interface import \
         | 
| 28 | 
             
                        flash_attn_varlen_qkvpacked_func as flash_attn_unpadded_qkvpacked_func
         | 
| 29 | 
            +
             | 
| 30 | 
             
                from flash_attn.bert_padding import pad_input, unpad_input
         | 
| 31 | 
            +
             | 
| 32 | 
             
                has_flash_attn = True
         | 
| 33 | 
             
            except:
         | 
| 34 | 
             
                print('FlashAttention is not installed.')
         | 
|  | |
| 47 | 
             
                    attention_dropout: The dropout rate to apply to the attention
         | 
| 48 | 
             
                                       (default: 0.0)
         | 
| 49 | 
             
                """
         | 
| 50 | 
            +
             | 
| 51 | 
             
                def __init__(self, softmax_scale=None, attention_dropout=0.0, device=None, dtype=None):
         | 
| 52 | 
             
                    super().__init__()
         | 
| 53 | 
             
                    self.softmax_scale = softmax_scale
         | 
| 54 | 
             
                    self.dropout_p = attention_dropout
         | 
| 55 | 
            +
             | 
| 56 | 
             
                def forward(self, qkv, key_padding_mask=None, causal=False, cu_seqlens=None,
         | 
| 57 | 
             
                            max_s=None, need_weights=False):
         | 
| 58 | 
             
                    """Implements the multihead softmax attention.
         | 
|  | |
| 65 | 
             
                    assert not need_weights
         | 
| 66 | 
             
                    assert qkv.dtype in [torch.float16, torch.bfloat16]
         | 
| 67 | 
             
                    assert qkv.is_cuda
         | 
| 68 | 
            +
             | 
| 69 | 
             
                    if cu_seqlens is None:
         | 
| 70 | 
             
                        batch_size = qkv.shape[0]
         | 
| 71 | 
             
                        seqlen = qkv.shape[1]
         | 
|  | |
| 97 | 
             
                            qkv, cu_seqlens, max_s, self.dropout_p if self.training else 0.0,
         | 
| 98 | 
             
                            softmax_scale=self.softmax_scale, causal=causal
         | 
| 99 | 
             
                        )
         | 
| 100 | 
            +
             | 
| 101 | 
             
                    return output, None
         | 
| 102 |  | 
| 103 |  | 
|  | |
| 160 | 
             
                    target_dtype = pos_embed.dtype
         | 
| 161 | 
             
                    pos_embed = pos_embed.float().reshape(
         | 
| 162 | 
             
                        1, self.image_size // self.patch_size, self.image_size // self.patch_size, -1).permute(0, 3, 1, 2)
         | 
| 163 | 
            +
                    pos_embed = F.interpolate(pos_embed, size=(H, W), mode='bicubic', align_corners=False). \
         | 
| 164 | 
             
                        reshape(1, -1, H * W).permute(0, 2, 1).to(target_dtype)
         | 
| 165 | 
             
                    return pos_embed
         | 
| 166 |  | 
    	
        modeling_internlm2.py
    CHANGED
    
    | @@ -48,16 +48,13 @@ _CONFIG_FOR_DOC = 'InternLM2Config' | |
| 48 |  | 
| 49 | 
             
            flash_attn_func, flash_attn_varlen_func = None, None
         | 
| 50 | 
             
            pad_input, index_first_axis, unpad_input = None, None, None
         | 
| 51 | 
            -
             | 
| 52 | 
             
            try:
         | 
| 53 | 
             
                from flash_attn import flash_attn_func as _flash_attn_func
         | 
| 54 | 
            -
                from flash_attn import  | 
| 55 | 
            -
             | 
| 56 | 
            -
                from flash_attn.bert_padding import \
         | 
| 57 | 
            -
                    index_first_axis as _index_first_axis
         | 
| 58 | 
             
                from flash_attn.bert_padding import pad_input as _pad_input
         | 
| 59 | 
             
                from flash_attn.bert_padding import unpad_input as _unpad_input
         | 
| 60 | 
            -
             | 
| 61 | 
             
                flash_attn_func, flash_attn_varlen_func = _flash_attn_func, _flash_attn_varlen_func
         | 
| 62 | 
             
                pad_input, index_first_axis, unpad_input = _pad_input, _index_first_axis, _unpad_input
         | 
| 63 | 
             
                has_flash_attn = True
         | 
| @@ -164,7 +161,7 @@ class InternLM2RotaryEmbedding(nn.Module): | |
| 164 |  | 
| 165 | 
             
                def _set_cos_sin_cache(self, seq_len, device, dtype):
         | 
| 166 | 
             
                    self.max_seq_len_cached = seq_len
         | 
| 167 | 
            -
                    t = torch.arange(self.max_seq_len_cached, device=device | 
| 168 |  | 
| 169 | 
             
                    freqs = torch.einsum('i,j->ij', t, self.inv_freq)
         | 
| 170 | 
             
                    # Different from paper, but it uses a different permutation in order to obtain the same calculation
         | 
| @@ -193,7 +190,7 @@ class InternLM2LinearScalingRotaryEmbedding(InternLM2RotaryEmbedding): | |
| 193 |  | 
| 194 | 
             
                def _set_cos_sin_cache(self, seq_len, device, dtype):
         | 
| 195 | 
             
                    self.max_seq_len_cached = seq_len
         | 
| 196 | 
            -
                    t = torch.arange(self.max_seq_len_cached, device=device | 
| 197 | 
             
                    t = t / self.scaling_factor
         | 
| 198 |  | 
| 199 | 
             
                    freqs = torch.einsum('i,j->ij', t, self.inv_freq)
         | 
| @@ -223,7 +220,7 @@ class InternLM2DynamicNTKScalingRotaryEmbedding(InternLM2RotaryEmbedding): | |
| 223 | 
             
                        inv_freq = 1.0 / (base ** (torch.arange(0, self.dim, 2).float().to(device) / self.dim))
         | 
| 224 | 
             
                        self.register_buffer('inv_freq', inv_freq, persistent=False)
         | 
| 225 |  | 
| 226 | 
            -
                    t = torch.arange(self.max_seq_len_cached, device=device | 
| 227 |  | 
| 228 | 
             
                    freqs = torch.einsum('i,j->ij', t, self.inv_freq)
         | 
| 229 | 
             
                    # Different from paper, but it uses a different permutation in order to obtain the same calculation
         | 
| @@ -810,6 +807,9 @@ class InternLM2Model(InternLM2PreTrainedModel): | |
| 810 | 
             
                    self.padding_idx = config.pad_token_id
         | 
| 811 | 
             
                    self.vocab_size = config.vocab_size
         | 
| 812 | 
             
                    self.config = config
         | 
|  | |
|  | |
|  | |
| 813 |  | 
| 814 | 
             
                    self.tok_embeddings = nn.Embedding(config.vocab_size, config.hidden_size, self.padding_idx)
         | 
| 815 |  | 
| @@ -870,7 +870,7 @@ class InternLM2Model(InternLM2PreTrainedModel): | |
| 870 |  | 
| 871 | 
             
                    return_dict = return_dict if return_dict is not None else self.config.use_return_dict
         | 
| 872 |  | 
| 873 | 
            -
                    if self.config.attn_implementation == 'flash_attention_2' | 
| 874 | 
             
                        _import_flash_attn()
         | 
| 875 |  | 
| 876 | 
             
                    # retrieve input_ids and inputs_embeds
         | 
|  | |
| 48 |  | 
| 49 | 
             
            flash_attn_func, flash_attn_varlen_func = None, None
         | 
| 50 | 
             
            pad_input, index_first_axis, unpad_input = None, None, None
         | 
|  | |
| 51 | 
             
            try:
         | 
| 52 | 
             
                from flash_attn import flash_attn_func as _flash_attn_func
         | 
| 53 | 
            +
                from flash_attn import flash_attn_varlen_func as _flash_attn_varlen_func
         | 
| 54 | 
            +
                from flash_attn.bert_padding import index_first_axis as _index_first_axis
         | 
|  | |
|  | |
| 55 | 
             
                from flash_attn.bert_padding import pad_input as _pad_input
         | 
| 56 | 
             
                from flash_attn.bert_padding import unpad_input as _unpad_input
         | 
| 57 | 
            +
             | 
| 58 | 
             
                flash_attn_func, flash_attn_varlen_func = _flash_attn_func, _flash_attn_varlen_func
         | 
| 59 | 
             
                pad_input, index_first_axis, unpad_input = _pad_input, _index_first_axis, _unpad_input
         | 
| 60 | 
             
                has_flash_attn = True
         | 
|  | |
| 161 |  | 
| 162 | 
             
                def _set_cos_sin_cache(self, seq_len, device, dtype):
         | 
| 163 | 
             
                    self.max_seq_len_cached = seq_len
         | 
| 164 | 
            +
                    t = torch.arange(self.max_seq_len_cached, device=device).to(dtype=self.inv_freq.dtype)
         | 
| 165 |  | 
| 166 | 
             
                    freqs = torch.einsum('i,j->ij', t, self.inv_freq)
         | 
| 167 | 
             
                    # Different from paper, but it uses a different permutation in order to obtain the same calculation
         | 
|  | |
| 190 |  | 
| 191 | 
             
                def _set_cos_sin_cache(self, seq_len, device, dtype):
         | 
| 192 | 
             
                    self.max_seq_len_cached = seq_len
         | 
| 193 | 
            +
                    t = torch.arange(self.max_seq_len_cached, device=device).to(dtype=self.inv_freq.dtype)
         | 
| 194 | 
             
                    t = t / self.scaling_factor
         | 
| 195 |  | 
| 196 | 
             
                    freqs = torch.einsum('i,j->ij', t, self.inv_freq)
         | 
|  | |
| 220 | 
             
                        inv_freq = 1.0 / (base ** (torch.arange(0, self.dim, 2).float().to(device) / self.dim))
         | 
| 221 | 
             
                        self.register_buffer('inv_freq', inv_freq, persistent=False)
         | 
| 222 |  | 
| 223 | 
            +
                    t = torch.arange(self.max_seq_len_cached, device=device).to(dtype=self.inv_freq.dtype)
         | 
| 224 |  | 
| 225 | 
             
                    freqs = torch.einsum('i,j->ij', t, self.inv_freq)
         | 
| 226 | 
             
                    # Different from paper, but it uses a different permutation in order to obtain the same calculation
         | 
|  | |
| 807 | 
             
                    self.padding_idx = config.pad_token_id
         | 
| 808 | 
             
                    self.vocab_size = config.vocab_size
         | 
| 809 | 
             
                    self.config = config
         | 
| 810 | 
            +
                    if not has_flash_attn:
         | 
| 811 | 
            +
                        self.config.attn_implementation = 'eager'
         | 
| 812 | 
            +
                        print('Warning: Flash attention is not available, using eager attention instead.')
         | 
| 813 |  | 
| 814 | 
             
                    self.tok_embeddings = nn.Embedding(config.vocab_size, config.hidden_size, self.padding_idx)
         | 
| 815 |  | 
|  | |
| 870 |  | 
| 871 | 
             
                    return_dict = return_dict if return_dict is not None else self.config.use_return_dict
         | 
| 872 |  | 
| 873 | 
            +
                    if self.config.attn_implementation == 'flash_attention_2':
         | 
| 874 | 
             
                        _import_flash_attn()
         | 
| 875 |  | 
| 876 | 
             
                    # retrieve input_ids and inputs_embeds
         | 
    	
        modeling_internvl_chat.py
    CHANGED
    
    | @@ -233,7 +233,7 @@ class InternVLChatModel(PreTrainedModel): | |
| 233 | 
             
                                     return_history=False, IMG_START_TOKEN='<img>', IMG_END_TOKEN='</img>',
         | 
| 234 | 
             
                                     IMG_CONTEXT_TOKEN='<IMG_CONTEXT>'):
         | 
| 235 | 
             
                    if history is not None or return_history:
         | 
| 236 | 
            -
                        print( | 
| 237 | 
             
                        raise NotImplementedError
         | 
| 238 | 
             
                    img_context_token_id = tokenizer.convert_tokens_to_ids(IMG_CONTEXT_TOKEN)
         | 
| 239 | 
             
                    self.img_context_token_id = img_context_token_id
         | 
| @@ -241,9 +241,9 @@ class InternVLChatModel(PreTrainedModel): | |
| 241 | 
             
                        eos_token_id = tokenizer.convert_tokens_to_ids('<|im_end|>')  # 92542, InternLM2
         | 
| 242 | 
             
                    else:
         | 
| 243 | 
             
                        eos_token_id = tokenizer.eos_token_id
         | 
| 244 | 
            -
             | 
| 245 | 
             
                    from .conversation import get_conv_template
         | 
| 246 | 
            -
             | 
| 247 | 
             
                    queries = []
         | 
| 248 | 
             
                    image_bs = pixel_values.shape[0]
         | 
| 249 | 
             
                    # print(f'dynamic ViT batch size: {image_bs}, image_counts: {image_counts}')
         | 
| @@ -260,7 +260,7 @@ class InternVLChatModel(PreTrainedModel): | |
| 260 | 
             
                    input_ids = model_inputs['input_ids'].cuda()
         | 
| 261 | 
             
                    attention_mask = model_inputs['attention_mask'].cuda()
         | 
| 262 | 
             
                    generation_config['eos_token_id'] = eos_token_id
         | 
| 263 | 
            -
             | 
| 264 | 
             
                    generation_output = self.generate(
         | 
| 265 | 
             
                        pixel_values=pixel_values,
         | 
| 266 | 
             
                        input_ids=input_ids,
         | 
|  | |
| 233 | 
             
                                     return_history=False, IMG_START_TOKEN='<img>', IMG_END_TOKEN='</img>',
         | 
| 234 | 
             
                                     IMG_CONTEXT_TOKEN='<IMG_CONTEXT>'):
         | 
| 235 | 
             
                    if history is not None or return_history:
         | 
| 236 | 
            +
                        print('Now multi-turn chat is not supported in batch_chat.')
         | 
| 237 | 
             
                        raise NotImplementedError
         | 
| 238 | 
             
                    img_context_token_id = tokenizer.convert_tokens_to_ids(IMG_CONTEXT_TOKEN)
         | 
| 239 | 
             
                    self.img_context_token_id = img_context_token_id
         | 
|  | |
| 241 | 
             
                        eos_token_id = tokenizer.convert_tokens_to_ids('<|im_end|>')  # 92542, InternLM2
         | 
| 242 | 
             
                    else:
         | 
| 243 | 
             
                        eos_token_id = tokenizer.eos_token_id
         | 
| 244 | 
            +
             | 
| 245 | 
             
                    from .conversation import get_conv_template
         | 
| 246 | 
            +
             | 
| 247 | 
             
                    queries = []
         | 
| 248 | 
             
                    image_bs = pixel_values.shape[0]
         | 
| 249 | 
             
                    # print(f'dynamic ViT batch size: {image_bs}, image_counts: {image_counts}')
         | 
|  | |
| 260 | 
             
                    input_ids = model_inputs['input_ids'].cuda()
         | 
| 261 | 
             
                    attention_mask = model_inputs['attention_mask'].cuda()
         | 
| 262 | 
             
                    generation_config['eos_token_id'] = eos_token_id
         | 
| 263 | 
            +
             | 
| 264 | 
             
                    generation_output = self.generate(
         | 
| 265 | 
             
                        pixel_values=pixel_values,
         | 
| 266 | 
             
                        input_ids=input_ids,
         | 
    	
        preprocessor_config.json
    CHANGED
    
    | @@ -16,4 +16,4 @@ | |
| 16 | 
             
              ],
         | 
| 17 | 
             
              "resample": 3,
         | 
| 18 | 
             
              "size": 448
         | 
| 19 | 
            -
            }
         | 
|  | |
| 16 | 
             
              ],
         | 
| 17 | 
             
              "resample": 3,
         | 
| 18 | 
             
              "size": 448
         | 
| 19 | 
            +
            }
         | 
