czczup commited on
Commit
05b2052
1 Parent(s): af349a9

Upload folder using huggingface_hub

Browse files
Files changed (2) hide show
  1. README.md +2 -2
  2. conversation.py +1 -1
README.md CHANGED
@@ -15,7 +15,7 @@ We released [🤗 InternVL-Chat-V1-1](https://huggingface.co/OpenGVLab/InternVL-
15
  As shown in the figure below, we connected our InternViT-6B to LLaMA2-13B through a simple MLP projector. Note that the LLaMA2-13B used here is not the original model but an internal chat version obtained by incrementally pre-training and fine-tuning the LLaMA2-13B base model for Chinese language tasks. Overall, our model has a total of 19 billion parameters.
16
 
17
  <p align="center">
18
- <img src="https://cdn-uploads.huggingface.co/production/uploads/64119264f0f81eb569e0d569/HD29tU-g0An9FpQn1yK8X.png" style="width: 75%;">
19
  </p>
20
 
21
  In this version, we explored increasing the resolution to 448 × 448, enhancing OCR capabilities, and improving support for Chinese conversations. Since the 448 × 448 input image generates 1024 visual tokens after passing through the ViT, leading to a significant computational burden, we use a pixel shuffle operation to reduce the 1024 tokens to 256 tokens.
@@ -122,7 +122,7 @@ The reason for writing the code this way is to avoid errors that occur during mu
122
  ```python
123
  import math
124
  import torch
125
- from transformers import AutoTokenizer, AutoModel, CLIPImageProcessor
126
 
127
  def split_model(model_name):
128
  device_map = {}
 
15
  As shown in the figure below, we connected our InternViT-6B to LLaMA2-13B through a simple MLP projector. Note that the LLaMA2-13B used here is not the original model but an internal chat version obtained by incrementally pre-training and fine-tuning the LLaMA2-13B base model for Chinese language tasks. Overall, our model has a total of 19 billion parameters.
16
 
17
  <p align="center">
18
+ <img src="https://cdn-uploads.huggingface.co/production/uploads/64119264f0f81eb569e0d569/HD29tU-g0An9FpQn1yK8X.png" style="width: 75%;">
19
  </p>
20
 
21
  In this version, we explored increasing the resolution to 448 × 448, enhancing OCR capabilities, and improving support for Chinese conversations. Since the 448 × 448 input image generates 1024 visual tokens after passing through the ViT, leading to a significant computational burden, we use a pixel shuffle operation to reduce the 1024 tokens to 256 tokens.
 
122
  ```python
123
  import math
124
  import torch
125
+ from transformers import AutoTokenizer, AutoModel
126
 
127
  def split_model(model_name):
128
  device_map = {}
conversation.py CHANGED
@@ -2,7 +2,7 @@
2
  Conversation prompt templates.
3
 
4
  We kindly request that you import fastchat instead of copying this file if you wish to use it.
5
- If you have any changes in mind, please contribute back so the community can benefit collectively and continue to maintain these valuable templates.
6
  """
7
 
8
  import dataclasses
 
2
  Conversation prompt templates.
3
 
4
  We kindly request that you import fastchat instead of copying this file if you wish to use it.
5
+ If you have changes in mind, please contribute back so the community can benefit collectively and continue to maintain these valuable templates.
6
  """
7
 
8
  import dataclasses