Upload folder using huggingface_hub

Browse files

Files changed (7) hide show

README.md +29 -22
config.json +10 -5
conversation.py +0 -1
modeling_intern_vit.py +7 -7
modeling_internlm2.py +10 -10
modeling_internvl_chat.py +4 -4
preprocessor_config.json +1 -1

README.md CHANGED Viewed

@@ -1,55 +1,57 @@
 ---
 license: mit
 datasets:
-- laion/laion2B-en
-- laion/laion-coco
-- laion/laion2B-multi
-- kakaobrain/coyo-700m
-- conceptual_captions
-- wanng/wukong100m
 pipeline_tag: visual-question-answering
 ---
 # Model Card for Mini-InternVL-Chat-2B-V1-5
 <p align="center">
   <img src="https://cdn-uploads.huggingface.co/production/uploads/64119264f0f81eb569e0d569/D60YzQBIzvoCvLRp2gZ0A.jpeg" alt="Image Description" width="300" height="300" />
 </p>
 > _Two interns holding hands, symbolizing the integration of InternViT and InternLM._
-\[[InternVL 1.5 Technical Report](https://arxiv.org/abs/2404.16821)\]  \[[CVPR Paper](https://arxiv.org/abs/2312.14238)\]  \[[GitHub](https://github.com/OpenGVLab/InternVL)\] \[[Chat Demo](https://internvl.opengvlab.com/)\] \[[中文解读](https://zhuanlan.zhihu.com/p/675877376)]
 You can run multimodal large models using a 1080Ti now.
 We are delighted to introduce the Mini-InternVL-Chat series. In the era of large language models, many researchers have started to focus on smaller language models, such as Gemma-2B, Qwen-1.8B, and InternLM2-1.8B. Inspired by their efforts, we have distilled our vision foundation model [InternViT-6B-448px-V1-5](https://huggingface.co/OpenGVLab/InternViT-6B-448px-V1-5) down to 300M and used [InternLM2-Chat-1.8B](https://huggingface.co/internlm/internlm2-chat-1_8b) or [Phi-3-mini-128k-instruct](https://huggingface.co/microsoft/Phi-3-mini-128k-instruct) as our language model. This resulted in a small multimodal model with excellent performance.
 As shown in the figure below, we adopted the same model architecture as InternVL 1.5. We simply replaced the original InternViT-6B with InternViT-300M and InternLM2-Chat-20B with InternLM2-Chat-1.8B / Phi-3-mini-128k-instruct. For training, we used the same data as InternVL 1.5 to train this smaller model. Additionally, due to the lower training costs of smaller models, we used a context length of 8K during training.
 ![image/png](https://cdn-uploads.huggingface.co/production/uploads/64006c09330a45b03605bba3/rDyoe66Sqev44T0wsP5Z7.png)
 ## Model Details
 - **Model Type:** multimodal large language model (MLLM)
 - **Model Stats:**
   - Architecture: [InternViT-300M-448px](https://huggingface.co/OpenGVLab/InternViT-300M-448px) + MLP + [InternLM2-Chat-1.8B](https://huggingface.co/internlm/internlm2-chat-1_8b)
   - Image size: dynamic resolution, max to 40 tiles of 448 x 448 (4K resolution).
   - Params: 2.2B
 - **Training Strategy:**
   - Learnable component in the pretraining stage: ViT + MLP
   - Learnable component in the finetuning stage: ViT + MLP + LLM
-  - For more details on training hyperparameters, take a look at our code: [pretrain]() | [finetune]()
 ## Released Models
-| Model                                                      | Vision Foundation Model                                                     | Release Date           |Note                                |
-| :---------------------------------------------------------:|:--------------------------------------------------------------------------: |:----------------------:| :---------------------------------- |
-| InternVL-Chat-V1.5(🤗 [HF link](https://huggingface.co/OpenGVLab/InternVL-Chat-V1-5))      | InternViT-6B-448px-V1-5(🤗 [HF link](https://huggingface.co/OpenGVLab/InternViT-6B-448px-V1-5))    |2024.04.18       |          support 4K image; super strong OCR; Approaching the performance of GPT-4V and Gemini Pro on various benchmarks like MMMU, DocVQA, ChartQA, MathVista, etc. (🔥new)|
-| InternVL-Chat-V1.2-Plus(🤗 [HF link](https://huggingface.co/OpenGVLab/InternVL-Chat-V1-2-Plus) ) |InternViT-6B-448px-V1-2(🤗 [HF link](https://huggingface.co/OpenGVLab/InternViT-6B-448px-V1-2))    |2024.02.21     |        more SFT data and stronger  |
-| InternVL-Chat-V1.2(🤗 [HF link](https://huggingface.co/OpenGVLab/InternVL-Chat-V1-2) )      |InternViT-6B-448px-V1-2(🤗 [HF link](https://huggingface.co/OpenGVLab/InternViT-6B-448px-V1-2))     |2024.02.11       |             scaling up LLM to 34B       |
-| InternVL-Chat-V1.1(🤗 [HF link](https://huggingface.co/OpenGVLab/InternVL-Chat-V1-1))      |InternViT-6B-448px-V1-0(🤗 [HF link](https://huggingface.co/OpenGVLab/InternViT-6B-448px-V1-0))    |2024.01.24         |   support Chinese and stronger OCR   |
 ## Performance
@@ -59,7 +61,7 @@ As shown in the figure below, we adopted the same model architecture as InternVL
 We provide an example code to run Mini-InternVL-Chat-2B-V1.5 using `transformers`.
-You can also use our [online demo](https://internvl.opengvlab.com/) to get a quick experience of this model.
 > Please use transformers==4.37.2 to ensure the model works normally.
@@ -150,7 +152,6 @@ def load_image(image_file, input_size=448, max_num=6):
     pixel_values = torch.stack(pixel_values)
     return pixel_values
 path = "OpenGVLab/Mini-InternVL-Chat-2B-V1-5"
 model = AutoModel.from_pretrained(
     path,
@@ -222,12 +223,18 @@ If you find this project useful in your research, please consider citing:
   journal={arXiv preprint arXiv:2312.14238},
   year={2023}
 }
 ```
 ## License
-This project is released under the MIT license.
 ## Acknowledgement
-InternVL is built with reference to the code of the following projects: [OpenAI CLIP](https://github.com/openai/CLIP), [Open CLIP](https://github.com/mlfoundations/open_clip), [CLIP Benchmark](https://github.com/LAION-AI/CLIP_benchmark), [EVA](https://github.com/baaivision/EVA/tree/master), [InternImage](https://github.com/OpenGVLab/InternImage), [ViT-Adapter](https://github.com/czczup/ViT-Adapter), [MMSegmentation](https://github.com/open-mmlab/mmsegmentation), [Transformers](https://github.com/huggingface/transformers), [DINOv2](https://github.com/facebookresearch/dinov2), [BLIP-2](https://github.com/salesforce/LAVIS/tree/main/projects/blip2), [Qwen-VL](https://github.com/QwenLM/Qwen-VL/tree/master/eval_mm), and [LLaVA-1.5](https://github.com/haotian-liu/LLaVA). Thanks for their awesome work!

 ---
 license: mit
 datasets:
+  - laion/laion2B-en
+  - laion/laion-coco
+  - laion/laion2B-multi
+  - kakaobrain/coyo-700m
+  - conceptual_captions
+  - wanng/wukong100m
 pipeline_tag: visual-question-answering
 ---
 # Model Card for Mini-InternVL-Chat-2B-V1-5
 <p align="center">
   <img src="https://cdn-uploads.huggingface.co/production/uploads/64119264f0f81eb569e0d569/D60YzQBIzvoCvLRp2gZ0A.jpeg" alt="Image Description" width="300" height="300" />
 </p>
 > _Two interns holding hands, symbolizing the integration of InternViT and InternLM._
+\[[InternVL 1.5 Technical Report](https://arxiv.org/abs/2404.16821)\]  \[[CVPR Paper](https://arxiv.org/abs/2312.14238)\]  \[[GitHub](https://github.com/OpenGVLab/InternVL)\] \[[Chat Demo](https://internvl.opengvlab.com/)\] \[[中文解读](https://zhuanlan.zhihu.com/p/675877376)\]
 You can run multimodal large models using a 1080Ti now.
 We are delighted to introduce the Mini-InternVL-Chat series. In the era of large language models, many researchers have started to focus on smaller language models, such as Gemma-2B, Qwen-1.8B, and InternLM2-1.8B. Inspired by their efforts, we have distilled our vision foundation model [InternViT-6B-448px-V1-5](https://huggingface.co/OpenGVLab/InternViT-6B-448px-V1-5) down to 300M and used [InternLM2-Chat-1.8B](https://huggingface.co/internlm/internlm2-chat-1_8b) or [Phi-3-mini-128k-instruct](https://huggingface.co/microsoft/Phi-3-mini-128k-instruct) as our language model. This resulted in a small multimodal model with excellent performance.
 As shown in the figure below, we adopted the same model architecture as InternVL 1.5. We simply replaced the original InternViT-6B with InternViT-300M and InternLM2-Chat-20B with InternLM2-Chat-1.8B / Phi-3-mini-128k-instruct. For training, we used the same data as InternVL 1.5 to train this smaller model. Additionally, due to the lower training costs of smaller models, we used a context length of 8K during training.
 ![image/png](https://cdn-uploads.huggingface.co/production/uploads/64006c09330a45b03605bba3/rDyoe66Sqev44T0wsP5Z7.png)
 ## Model Details
 - **Model Type:** multimodal large language model (MLLM)
 - **Model Stats:**
   - Architecture: [InternViT-300M-448px](https://huggingface.co/OpenGVLab/InternViT-300M-448px) + MLP + [InternLM2-Chat-1.8B](https://huggingface.co/internlm/internlm2-chat-1_8b)
   - Image size: dynamic resolution, max to 40 tiles of 448 x 448 (4K resolution).
   - Params: 2.2B
 - **Training Strategy:**
   - Learnable component in the pretraining stage: ViT + MLP
   - Learnable component in the finetuning stage: ViT + MLP + LLM
+  - For more details on training hyperparameters, take a look at our code: [pretrain](<>) | [finetune](<>)
 ## Released Models
+|                                              Model                                               |                                     Vision Foundation Model                                     | Release Date | Note                                                                                                                                                               |
+| :----------------------------------------------------------------------------------------------: | :---------------------------------------------------------------------------------------------: | :----------: | :----------------------------------------------------------------------------------------------------------------------------------------------------------------- |
+|      InternVL-Chat-V1.5(🤗 [HF link](https://huggingface.co/OpenGVLab/InternVL-Chat-V1-5))       | InternViT-6B-448px-V1-5(🤗 [HF link](https://huggingface.co/OpenGVLab/InternViT-6B-448px-V1-5)) |  2024.04.18  | support 4K image; super strong OCR; Approaching the performance of GPT-4V and Gemini Pro on various benchmarks like MMMU, DocVQA, ChartQA, MathVista, etc. (🔥new) |
+| InternVL-Chat-V1.2-Plus(🤗 [HF link](https://huggingface.co/OpenGVLab/InternVL-Chat-V1-2-Plus) ) | InternViT-6B-448px-V1-2(🤗 [HF link](https://huggingface.co/OpenGVLab/InternViT-6B-448px-V1-2)) |  2024.02.21  | more SFT data and stronger                                                                                                                                         |
+|      InternVL-Chat-V1.2(🤗 [HF link](https://huggingface.co/OpenGVLab/InternVL-Chat-V1-2) )      | InternViT-6B-448px-V1-2(🤗 [HF link](https://huggingface.co/OpenGVLab/InternViT-6B-448px-V1-2)) |  2024.02.11  | scaling up LLM to 34B                                                                                                                                              |
+|      InternVL-Chat-V1.1(🤗 [HF link](https://huggingface.co/OpenGVLab/InternVL-Chat-V1-1))       | InternViT-6B-448px-V1-0(🤗 [HF link](https://huggingface.co/OpenGVLab/InternViT-6B-448px-V1-0)) |  2024.01.24  | support Chinese and stronger OCR                                                                                                                                   |
 ## Performance
 We provide an example code to run Mini-InternVL-Chat-2B-V1.5 using `transformers`.
+You can also use our [online demo](https://internvl.opengvlab.com/) for a quick experience of this model.
 > Please use transformers==4.37.2 to ensure the model works normally.
     pixel_values = torch.stack(pixel_values)
     return pixel_values
 path = "OpenGVLab/Mini-InternVL-Chat-2B-V1-5"
 model = AutoModel.from_pretrained(
     path,
   journal={arXiv preprint arXiv:2312.14238},
   year={2023}
 }
+@article{chen2024far,
+  title={How Far Are We to GPT-4V? Closing the Gap to Commercial Multimodal Models with Open-Source Suites},
+  author={Chen, Zhe and Wang, Weiyun and Tian, Hao and Ye, Shenglong and Gao, Zhangwei and Cui, Erfei and Tong, Wenwen and Hu, Kongzhi and Luo, Jiapeng and Ma, Zheng and others},
+  journal={arXiv preprint arXiv:2404.16821},
+  year={2024}
+}
 ```
 ## License
+This project is released under the MIT license.
 ## Acknowledgement
+InternVL is built with reference to the code of the following projects: [OpenAI CLIP](https://github.com/openai/CLIP), [Open CLIP](https://github.com/mlfoundations/open_clip), [CLIP Benchmark](https://github.com/LAION-AI/CLIP_benchmark), [EVA](https://github.com/baaivision/EVA/tree/master), [InternImage](https://github.com/OpenGVLab/InternImage), [ViT-Adapter](https://github.com/czczup/ViT-Adapter), [MMSegmentation](https://github.com/open-mmlab/mmsegmentation), [Transformers](https://github.com/huggingface/transformers), [DINOv2](https://github.com/facebookresearch/dinov2), [BLIP-2](https://github.com/salesforce/LAVIS/tree/main/projects/blip2), [Qwen-VL](https://github.com/QwenLM/Qwen-VL/tree/master/eval_mm), and [LLaVA-1.5](https://github.com/haotian-liu/LLaVA). Thanks for their awesome work!

config.json CHANGED Viewed

@@ -6,13 +6,14 @@
   ],
   "auto_map": {
     "AutoConfig": "configuration_internvl_chat.InternVLChatConfig",
-    "AutoModel": "modeling_internvl_chat.InternVLChatModel"
   },
   "downsample_ratio": 0.5,
   "dynamic_image_size": true,
   "force_image_size": 448,
   "llm_config": {
-    "_name_or_path": "./pretrained/internlm2-chat-1_8b",
     "add_cross_attention": false,
     "architectures": [
       "InternLM2ForCausalLM"
@@ -113,12 +114,16 @@
   "use_llm_lora": 0,
   "use_thumbnail": true,
   "vision_config": {
-    "_name_or_path": "",
     "add_cross_attention": false,
     "architectures": [
       "InternVisionModel"
     ],
     "attention_dropout": 0.0,
     "bad_words_ids": null,
     "begin_suppress_tokens": null,
     "bos_token_id": null,
@@ -189,11 +194,11 @@
     "tokenizer_class": null,
     "top_k": 50,
     "top_p": 1.0,
-    "torch_dtype": "float32",
     "torchscript": false,
     "transformers_version": "4.36.2",
     "typical_p": 1.0,
-    "use_bfloat16": false,
     "use_flash_attn": true
   }
 }

   ],
   "auto_map": {
     "AutoConfig": "configuration_internvl_chat.InternVLChatConfig",
+    "AutoModel": "modeling_internvl_chat.InternVLChatModel",
+    "AutoModelForCausalLM": "modeling_internvl_chat.InternVLChatModel"
   },
   "downsample_ratio": 0.5,
   "dynamic_image_size": true,
   "force_image_size": 448,
   "llm_config": {
+    "_name_or_path": "pretrained/internlm2-chat-1_8b",
     "add_cross_attention": false,
     "architectures": [
       "InternLM2ForCausalLM"
   "use_llm_lora": 0,
   "use_thumbnail": true,
   "vision_config": {
+    "_name_or_path": "OpenGVLab/InternViT-300M-448px",
     "add_cross_attention": false,
     "architectures": [
       "InternVisionModel"
     ],
     "attention_dropout": 0.0,
+    "auto_map": {
+      "AutoConfig": "configuration_intern_vit.InternVisionConfig",
+      "AutoModel": "modeling_intern_vit.InternVisionModel"
+    },
     "bad_words_ids": null,
     "begin_suppress_tokens": null,
     "bos_token_id": null,
     "tokenizer_class": null,
     "top_k": 50,
     "top_p": 1.0,
+    "torch_dtype": "bfloat16",
     "torchscript": false,
     "transformers_version": "4.36.2",
     "typical_p": 1.0,
+    "use_bfloat16": true,
     "use_flash_attn": true
   }
 }

conversation.py CHANGED Viewed

@@ -1258,4 +1258,3 @@ register_conv_template(
         sep2='</s>',
     )
 )

         sep2='</s>',
     )
 )

modeling_intern_vit.py CHANGED Viewed

@@ -26,9 +26,9 @@ try:
     except:  # v2
         from flash_attn.flash_attn_interface import \
             flash_attn_varlen_qkvpacked_func as flash_attn_unpadded_qkvpacked_func
     from flash_attn.bert_padding import pad_input, unpad_input
     has_flash_attn = True
 except:
     print('FlashAttention is not installed.')
@@ -47,12 +47,12 @@ class FlashAttention(nn.Module):
         attention_dropout: The dropout rate to apply to the attention
                            (default: 0.0)
     """
     def __init__(self, softmax_scale=None, attention_dropout=0.0, device=None, dtype=None):
         super().__init__()
         self.softmax_scale = softmax_scale
         self.dropout_p = attention_dropout
     def forward(self, qkv, key_padding_mask=None, causal=False, cu_seqlens=None,
                 max_s=None, need_weights=False):
         """Implements the multihead softmax attention.
@@ -65,7 +65,7 @@ class FlashAttention(nn.Module):
         assert not need_weights
         assert qkv.dtype in [torch.float16, torch.bfloat16]
         assert qkv.is_cuda
         if cu_seqlens is None:
             batch_size = qkv.shape[0]
             seqlen = qkv.shape[1]
@@ -97,7 +97,7 @@ class FlashAttention(nn.Module):
                 qkv, cu_seqlens, max_s, self.dropout_p if self.training else 0.0,
                 softmax_scale=self.softmax_scale, causal=causal
             )
         return output, None
@@ -160,7 +160,7 @@ class InternVisionEmbeddings(nn.Module):
         target_dtype = pos_embed.dtype
         pos_embed = pos_embed.float().reshape(
             1, self.image_size // self.patch_size, self.image_size // self.patch_size, -1).permute(0, 3, 1, 2)
-        pos_embed = F.interpolate(pos_embed, size=(H, W), mode='bicubic', align_corners=False).\
             reshape(1, -1, H * W).permute(0, 2, 1).to(target_dtype)
         return pos_embed

     except:  # v2
         from flash_attn.flash_attn_interface import \
             flash_attn_varlen_qkvpacked_func as flash_attn_unpadded_qkvpacked_func
     from flash_attn.bert_padding import pad_input, unpad_input
     has_flash_attn = True
 except:
     print('FlashAttention is not installed.')
         attention_dropout: The dropout rate to apply to the attention
                            (default: 0.0)
     """
     def __init__(self, softmax_scale=None, attention_dropout=0.0, device=None, dtype=None):
         super().__init__()
         self.softmax_scale = softmax_scale
         self.dropout_p = attention_dropout
     def forward(self, qkv, key_padding_mask=None, causal=False, cu_seqlens=None,
                 max_s=None, need_weights=False):
         """Implements the multihead softmax attention.
         assert not need_weights
         assert qkv.dtype in [torch.float16, torch.bfloat16]
         assert qkv.is_cuda
         if cu_seqlens is None:
             batch_size = qkv.shape[0]
             seqlen = qkv.shape[1]
                 qkv, cu_seqlens, max_s, self.dropout_p if self.training else 0.0,
                 softmax_scale=self.softmax_scale, causal=causal
             )
         return output, None
         target_dtype = pos_embed.dtype
         pos_embed = pos_embed.float().reshape(
             1, self.image_size // self.patch_size, self.image_size // self.patch_size, -1).permute(0, 3, 1, 2)
+        pos_embed = F.interpolate(pos_embed, size=(H, W), mode='bicubic', align_corners=False). \
             reshape(1, -1, H * W).permute(0, 2, 1).to(target_dtype)
         return pos_embed

modeling_internlm2.py CHANGED Viewed

@@ -48,16 +48,13 @@ _CONFIG_FOR_DOC = 'InternLM2Config'
 flash_attn_func, flash_attn_varlen_func = None, None
 pad_input, index_first_axis, unpad_input = None, None, None
 try:
     from flash_attn import flash_attn_func as _flash_attn_func
-    from flash_attn import \
-        flash_attn_varlen_func as _flash_attn_varlen_func
-    from flash_attn.bert_padding import \
-        index_first_axis as _index_first_axis
     from flash_attn.bert_padding import pad_input as _pad_input
     from flash_attn.bert_padding import unpad_input as _unpad_input
     flash_attn_func, flash_attn_varlen_func = _flash_attn_func, _flash_attn_varlen_func
     pad_input, index_first_axis, unpad_input = _pad_input, _index_first_axis, _unpad_input
     has_flash_attn = True
@@ -164,7 +161,7 @@ class InternLM2RotaryEmbedding(nn.Module):
     def _set_cos_sin_cache(self, seq_len, device, dtype):
         self.max_seq_len_cached = seq_len
-        t = torch.arange(self.max_seq_len_cached, device=device, dtype=self.inv_freq.dtype)
         freqs = torch.einsum('i,j->ij', t, self.inv_freq)
         # Different from paper, but it uses a different permutation in order to obtain the same calculation
@@ -193,7 +190,7 @@ class InternLM2LinearScalingRotaryEmbedding(InternLM2RotaryEmbedding):
     def _set_cos_sin_cache(self, seq_len, device, dtype):
         self.max_seq_len_cached = seq_len
-        t = torch.arange(self.max_seq_len_cached, device=device, dtype=self.inv_freq.dtype)
         t = t / self.scaling_factor
         freqs = torch.einsum('i,j->ij', t, self.inv_freq)
@@ -223,7 +220,7 @@ class InternLM2DynamicNTKScalingRotaryEmbedding(InternLM2RotaryEmbedding):
             inv_freq = 1.0 / (base ** (torch.arange(0, self.dim, 2).float().to(device) / self.dim))
             self.register_buffer('inv_freq', inv_freq, persistent=False)
-        t = torch.arange(self.max_seq_len_cached, device=device, dtype=self.inv_freq.dtype)
         freqs = torch.einsum('i,j->ij', t, self.inv_freq)
         # Different from paper, but it uses a different permutation in order to obtain the same calculation
@@ -810,6 +807,9 @@ class InternLM2Model(InternLM2PreTrainedModel):
         self.padding_idx = config.pad_token_id
         self.vocab_size = config.vocab_size
         self.config = config
         self.tok_embeddings = nn.Embedding(config.vocab_size, config.hidden_size, self.padding_idx)
@@ -870,7 +870,7 @@ class InternLM2Model(InternLM2PreTrainedModel):
         return_dict = return_dict if return_dict is not None else self.config.use_return_dict
-        if self.config.attn_implementation == 'flash_attention_2' and has_flash_attn:
             _import_flash_attn()
         # retrieve input_ids and inputs_embeds

 flash_attn_func, flash_attn_varlen_func = None, None
 pad_input, index_first_axis, unpad_input = None, None, None
 try:
     from flash_attn import flash_attn_func as _flash_attn_func
+    from flash_attn import flash_attn_varlen_func as _flash_attn_varlen_func
+    from flash_attn.bert_padding import index_first_axis as _index_first_axis
     from flash_attn.bert_padding import pad_input as _pad_input
     from flash_attn.bert_padding import unpad_input as _unpad_input
     flash_attn_func, flash_attn_varlen_func = _flash_attn_func, _flash_attn_varlen_func
     pad_input, index_first_axis, unpad_input = _pad_input, _index_first_axis, _unpad_input
     has_flash_attn = True
     def _set_cos_sin_cache(self, seq_len, device, dtype):
         self.max_seq_len_cached = seq_len
+        t = torch.arange(self.max_seq_len_cached, device=device).to(dtype=self.inv_freq.dtype)
         freqs = torch.einsum('i,j->ij', t, self.inv_freq)
         # Different from paper, but it uses a different permutation in order to obtain the same calculation
     def _set_cos_sin_cache(self, seq_len, device, dtype):
         self.max_seq_len_cached = seq_len
+        t = torch.arange(self.max_seq_len_cached, device=device).to(dtype=self.inv_freq.dtype)
         t = t / self.scaling_factor
         freqs = torch.einsum('i,j->ij', t, self.inv_freq)
             inv_freq = 1.0 / (base ** (torch.arange(0, self.dim, 2).float().to(device) / self.dim))
             self.register_buffer('inv_freq', inv_freq, persistent=False)
+        t = torch.arange(self.max_seq_len_cached, device=device).to(dtype=self.inv_freq.dtype)
         freqs = torch.einsum('i,j->ij', t, self.inv_freq)
         # Different from paper, but it uses a different permutation in order to obtain the same calculation
         self.padding_idx = config.pad_token_id
         self.vocab_size = config.vocab_size
         self.config = config
+        if not has_flash_attn:
+            self.config.attn_implementation = 'eager'
+            print('Warning: Flash attention is not available, using eager attention instead.')
         self.tok_embeddings = nn.Embedding(config.vocab_size, config.hidden_size, self.padding_idx)
         return_dict = return_dict if return_dict is not None else self.config.use_return_dict
+        if self.config.attn_implementation == 'flash_attention_2':
             _import_flash_attn()
         # retrieve input_ids and inputs_embeds

modeling_internvl_chat.py CHANGED Viewed

@@ -233,7 +233,7 @@ class InternVLChatModel(PreTrainedModel):
                          return_history=False, IMG_START_TOKEN='<img>', IMG_END_TOKEN='</img>',
                          IMG_CONTEXT_TOKEN='<IMG_CONTEXT>'):
         if history is not None or return_history:
-            print("Now multi-turn chat is not supported in batch_chat.")
             raise NotImplementedError
         img_context_token_id = tokenizer.convert_tokens_to_ids(IMG_CONTEXT_TOKEN)
         self.img_context_token_id = img_context_token_id
@@ -241,9 +241,9 @@ class InternVLChatModel(PreTrainedModel):
             eos_token_id = tokenizer.convert_tokens_to_ids('<|im_end|>')  # 92542, InternLM2
         else:
             eos_token_id = tokenizer.eos_token_id
         from .conversation import get_conv_template
         queries = []
         image_bs = pixel_values.shape[0]
         # print(f'dynamic ViT batch size: {image_bs}, image_counts: {image_counts}')
@@ -260,7 +260,7 @@ class InternVLChatModel(PreTrainedModel):
         input_ids = model_inputs['input_ids'].cuda()
         attention_mask = model_inputs['attention_mask'].cuda()
         generation_config['eos_token_id'] = eos_token_id
         generation_output = self.generate(
             pixel_values=pixel_values,
             input_ids=input_ids,

                          return_history=False, IMG_START_TOKEN='<img>', IMG_END_TOKEN='</img>',
                          IMG_CONTEXT_TOKEN='<IMG_CONTEXT>'):
         if history is not None or return_history:
+            print('Now multi-turn chat is not supported in batch_chat.')
             raise NotImplementedError
         img_context_token_id = tokenizer.convert_tokens_to_ids(IMG_CONTEXT_TOKEN)
         self.img_context_token_id = img_context_token_id
             eos_token_id = tokenizer.convert_tokens_to_ids('<|im_end|>')  # 92542, InternLM2
         else:
             eos_token_id = tokenizer.eos_token_id
         from .conversation import get_conv_template
         queries = []
         image_bs = pixel_values.shape[0]
         # print(f'dynamic ViT batch size: {image_bs}, image_counts: {image_counts}')
         input_ids = model_inputs['input_ids'].cuda()
         attention_mask = model_inputs['attention_mask'].cuda()
         generation_config['eos_token_id'] = eos_token_id
         generation_output = self.generate(
             pixel_values=pixel_values,
             input_ids=input_ids,

preprocessor_config.json CHANGED Viewed

@@ -16,4 +16,4 @@
   ],
   "resample": 3,
   "size": 448
-}

   ],
   "resample": 3,
   "size": 448
+}