OpenGVLab
/

InternVL-Chat-V1-1

Image-Text-to-Text

feature-extraction

Model card Files Files and versions Community

czczup commited on Jul 25, 2024

Commit

05b2052

·

verified ·

1 Parent(s): af349a9

Upload folder using huggingface_hub

Files changed (2) hide show

README.md +2 -2
conversation.py +1 -1

README.md CHANGED Viewed

@@ -15,7 +15,7 @@ We released [🤗 InternVL-Chat-V1-1](https://huggingface.co/OpenGVLab/InternVL-
 As shown in the figure below, we connected our InternViT-6B to LLaMA2-13B through a simple MLP projector. Note that the LLaMA2-13B used here is not the original model but an internal chat version obtained by incrementally pre-training and fine-tuning the LLaMA2-13B base model for Chinese language tasks. Overall, our model has a total of 19 billion parameters.
 <p align="center">
-    <img src="https://cdn-uploads.huggingface.co/production/uploads/64119264f0f81eb569e0d569/HD29tU-g0An9FpQn1yK8X.png" style="width: 75%;">
 </p>
 In this version, we explored increasing the resolution to 448 × 448, enhancing OCR capabilities, and improving support for Chinese conversations. Since the 448 × 448 input image generates 1024 visual tokens after passing through the ViT, leading to a significant computational burden, we use a pixel shuffle operation to reduce the 1024 tokens to 256 tokens.
@@ -122,7 +122,7 @@ The reason for writing the code this way is to avoid errors that occur during mu
 ```python
 import math
 import torch
-from transformers import AutoTokenizer, AutoModel, CLIPImageProcessor
 def split_model(model_name):
     device_map = {}

 As shown in the figure below, we connected our InternViT-6B to LLaMA2-13B through a simple MLP projector. Note that the LLaMA2-13B used here is not the original model but an internal chat version obtained by incrementally pre-training and fine-tuning the LLaMA2-13B base model for Chinese language tasks. Overall, our model has a total of 19 billion parameters.
 <p align="center">
+  <img src="https://cdn-uploads.huggingface.co/production/uploads/64119264f0f81eb569e0d569/HD29tU-g0An9FpQn1yK8X.png" style="width: 75%;">
 </p>
 In this version, we explored increasing the resolution to 448 × 448, enhancing OCR capabilities, and improving support for Chinese conversations. Since the 448 × 448 input image generates 1024 visual tokens after passing through the ViT, leading to a significant computational burden, we use a pixel shuffle operation to reduce the 1024 tokens to 256 tokens.
 ```python
 import math
 import torch
+from transformers import AutoTokenizer, AutoModel
 def split_model(model_name):
     device_map = {}

conversation.py CHANGED Viewed

@@ -2,7 +2,7 @@
 Conversation prompt templates.
 We kindly request that you import fastchat instead of copying this file if you wish to use it.
-If you have any changes in mind, please contribute back so the community can benefit collectively and continue to maintain these valuable templates.
 """
 import dataclasses

 Conversation prompt templates.
 We kindly request that you import fastchat instead of copying this file if you wish to use it.
+If you have changes in mind, please contribute back so the community can benefit collectively and continue to maintain these valuable templates.
 """
 import dataclasses