Upload folder using huggingface_hub
Browse files- README.md +2 -2
- conversation.py +1 -1
README.md
CHANGED
@@ -15,7 +15,7 @@ We released [🤗 InternVL-Chat-V1-1](https://huggingface.co/OpenGVLab/InternVL-
|
|
15 |
As shown in the figure below, we connected our InternViT-6B to LLaMA2-13B through a simple MLP projector. Note that the LLaMA2-13B used here is not the original model but an internal chat version obtained by incrementally pre-training and fine-tuning the LLaMA2-13B base model for Chinese language tasks. Overall, our model has a total of 19 billion parameters.
|
16 |
|
17 |
<p align="center">
|
18 |
-
|
19 |
</p>
|
20 |
|
21 |
In this version, we explored increasing the resolution to 448 × 448, enhancing OCR capabilities, and improving support for Chinese conversations. Since the 448 × 448 input image generates 1024 visual tokens after passing through the ViT, leading to a significant computational burden, we use a pixel shuffle operation to reduce the 1024 tokens to 256 tokens.
|
@@ -122,7 +122,7 @@ The reason for writing the code this way is to avoid errors that occur during mu
|
|
122 |
```python
|
123 |
import math
|
124 |
import torch
|
125 |
-
from transformers import AutoTokenizer, AutoModel
|
126 |
|
127 |
def split_model(model_name):
|
128 |
device_map = {}
|
|
|
15 |
As shown in the figure below, we connected our InternViT-6B to LLaMA2-13B through a simple MLP projector. Note that the LLaMA2-13B used here is not the original model but an internal chat version obtained by incrementally pre-training and fine-tuning the LLaMA2-13B base model for Chinese language tasks. Overall, our model has a total of 19 billion parameters.
|
16 |
|
17 |
<p align="center">
|
18 |
+
<img src="https://cdn-uploads.huggingface.co/production/uploads/64119264f0f81eb569e0d569/HD29tU-g0An9FpQn1yK8X.png" style="width: 75%;">
|
19 |
</p>
|
20 |
|
21 |
In this version, we explored increasing the resolution to 448 × 448, enhancing OCR capabilities, and improving support for Chinese conversations. Since the 448 × 448 input image generates 1024 visual tokens after passing through the ViT, leading to a significant computational burden, we use a pixel shuffle operation to reduce the 1024 tokens to 256 tokens.
|
|
|
122 |
```python
|
123 |
import math
|
124 |
import torch
|
125 |
+
from transformers import AutoTokenizer, AutoModel
|
126 |
|
127 |
def split_model(model_name):
|
128 |
device_map = {}
|
conversation.py
CHANGED
@@ -2,7 +2,7 @@
|
|
2 |
Conversation prompt templates.
|
3 |
|
4 |
We kindly request that you import fastchat instead of copying this file if you wish to use it.
|
5 |
-
If you have
|
6 |
"""
|
7 |
|
8 |
import dataclasses
|
|
|
2 |
Conversation prompt templates.
|
3 |
|
4 |
We kindly request that you import fastchat instead of copying this file if you wish to use it.
|
5 |
+
If you have changes in mind, please contribute back so the community can benefit collectively and continue to maintain these valuable templates.
|
6 |
"""
|
7 |
|
8 |
import dataclasses
|