yonigozlan HF Staff commited on
Commit
f02d59b
Β·
verified Β·
1 Parent(s): d6aee8c

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +10 -10
README.md CHANGED
@@ -5,7 +5,7 @@ license_link: https://huggingface.co/Qwen/Qwen2.5-72B-Instruct/blob/main/LICENSE
5
  pipeline_tag: image-text-to-text
6
  library_name: transformers
7
  base_model:
8
- - OpenGVLab/InternVL3-1B-Instruct
9
  base_model_relation: finetune
10
  datasets:
11
  - OpenGVLab/MMPR-v1.2
@@ -15,7 +15,7 @@ tags:
15
  - internvl
16
  ---
17
 
18
- # InternVL3-1B Transformers πŸ€— Implementation
19
 
20
  [\[πŸ“œ InternVL 1.0\]](https://huggingface.co/papers/2312.14238) [\[πŸ“œ InternVL 1.5\]](https://huggingface.co/papers/2404.16821) [\[πŸ“œ InternVL 2.5\]](https://huggingface.co/papers/2412.05271) [\[πŸ“œ InternVL2.5-MPO\]](https://huggingface.co/papers/2411.10442) [\[πŸ“œ InternVL3\]](https://huggingface.co/papers/2504.10479)
21
 
@@ -27,7 +27,7 @@ tags:
27
 
28
 
29
  > [!IMPORTANT]
30
- > This repository contains the Hugging Face πŸ€— Transformers implementation for the [OpenGVLab/InternVL3-1B](https://huggingface.co/OpenGVLab/InternVL3-1B) model.
31
  > It is intended to be functionally equivalent to the original OpenGVLab release.
32
  > As a native Transformers model, it supports core library features such as various attention implementations (eager, including SDPA, and FA2) and enables efficient batched inference with interleaved image, video, and text inputs.
33
 
@@ -39,7 +39,7 @@ Additionally, we compare InternVL3 with Qwen2.5 Chat models, whose correspondin
39
 
40
  ![image/png](https://huggingface.co/datasets/Weiyun1025/InternVL-Performance/resolve/main/internvl3/overall.png)
41
 
42
- You can find more info on the InternVL3 family in the original checkpoint [OpenGVLab/InternVL3-1B](https://huggingface.co/OpenGVLab/InternVL3-1B)
43
 
44
  ## Usage example
45
 
@@ -63,7 +63,7 @@ Here is how you can use the `image-text-to-text` pipeline to perform inference w
63
  ... },
64
  ... ]
65
 
66
- >>> pipe = pipeline("image-text-to-text", model="OpenGVLab/InternVL3-1B-hf")
67
  >>> outputs = pipe(text=messages, max_new_tokens=50, return_full_text=False)
68
  >>> outputs[0]["generated_text"]
69
  'The image showcases a vibrant scene of nature, featuring several flowers and a bee. \n\n1. **Foreground Flowers**: \n - The primary focus is on a large, pink cosmos flower with a prominent yellow center. The petals are soft and slightly r'
@@ -80,7 +80,7 @@ This example demonstrates how to perform inference on a single image with the In
80
  >>> import torch
81
 
82
  >>> torch_device = "cuda"
83
- >>> model_checkpoint = "OpenGVLab/InternVL3-1B-hf"
84
  >>> processor = AutoProcessor.from_pretrained(model_checkpoint)
85
  >>> model = AutoModelForImageTextToText.from_pretrained(model_checkpoint, device_map=torch_device, torch_dtype=torch.bfloat16)
86
 
@@ -112,7 +112,7 @@ This example shows how to generate text using the InternVL model without providi
112
  >>> import torch
113
 
114
  >>> torch_device = "cuda"
115
- >>> model_checkpoint = "OpenGVLab/InternVL3-1B-hf"
116
  >>> processor = AutoProcessor.from_pretrained(model_checkpoint)
117
  >>> model = AutoModelForImageTextToText.from_pretrained(model_checkpoint, device_map=torch_device, torch_dtype=torch.bfloat16)
118
 
@@ -142,7 +142,7 @@ InternVL models also support batched image and text inputs.
142
  >>> import torch
143
 
144
  >>> torch_device = "cuda"
145
- >>> model_checkpoint = "OpenGVLab/InternVL3-1B-hf"
146
  >>> processor = AutoProcessor.from_pretrained(model_checkpoint)
147
  >>> model = AutoModelForImageTextToText.from_pretrained(model_checkpoint, device_map=torch_device, torch_dtype=torch.bfloat16)
148
 
@@ -186,7 +186,7 @@ This implementation of the InternVL models supports batched text-images inputs w
186
  >>> import torch
187
 
188
  >>> torch_device = "cuda"
189
- >>> model_checkpoint = "OpenGVLab/InternVL3-1B-hf"
190
  >>> processor = AutoProcessor.from_pretrained(model_checkpoint)
191
  >>> model = AutoModelForImageTextToText.from_pretrained(model_checkpoint, device_map=torch_device, torch_dtype=torch.bfloat16)
192
 
@@ -268,7 +268,7 @@ This example showcases how to handle a batch of chat conversations with interlea
268
  >>> import torch
269
 
270
  >>> torch_device = "cuda"
271
- >>> model_checkpoint = "OpenGVLab/InternVL3-1B-hf"
272
  >>> processor = AutoProcessor.from_pretrained(model_checkpoint)
273
  >>> model = AutoModelForImageTextToText.from_pretrained(model_checkpoint, device_map=torch_device, torch_dtype=torch.bfloat16)
274
 
 
5
  pipeline_tag: image-text-to-text
6
  library_name: transformers
7
  base_model:
8
+ - OpenGVLab/InternVL3-8B-Instruct
9
  base_model_relation: finetune
10
  datasets:
11
  - OpenGVLab/MMPR-v1.2
 
15
  - internvl
16
  ---
17
 
18
+ # InternVL3-8B Transformers πŸ€— Implementation
19
 
20
  [\[πŸ“œ InternVL 1.0\]](https://huggingface.co/papers/2312.14238) [\[πŸ“œ InternVL 1.5\]](https://huggingface.co/papers/2404.16821) [\[πŸ“œ InternVL 2.5\]](https://huggingface.co/papers/2412.05271) [\[πŸ“œ InternVL2.5-MPO\]](https://huggingface.co/papers/2411.10442) [\[πŸ“œ InternVL3\]](https://huggingface.co/papers/2504.10479)
21
 
 
27
 
28
 
29
  > [!IMPORTANT]
30
+ > This repository contains the Hugging Face πŸ€— Transformers implementation for the [OpenGVLab/InternVL3-8B](https://huggingface.co/OpenGVLab/InternVL3-8B) model.
31
  > It is intended to be functionally equivalent to the original OpenGVLab release.
32
  > As a native Transformers model, it supports core library features such as various attention implementations (eager, including SDPA, and FA2) and enables efficient batched inference with interleaved image, video, and text inputs.
33
 
 
39
 
40
  ![image/png](https://huggingface.co/datasets/Weiyun1025/InternVL-Performance/resolve/main/internvl3/overall.png)
41
 
42
+ You can find more info on the InternVL3 family in the original checkpoint [OpenGVLab/InternVL3-8B](https://huggingface.co/OpenGVLab/InternVL3-8B)
43
 
44
  ## Usage example
45
 
 
63
  ... },
64
  ... ]
65
 
66
+ >>> pipe = pipeline("image-text-to-text", model="OpenGVLab/InternVL3-8B-hf")
67
  >>> outputs = pipe(text=messages, max_new_tokens=50, return_full_text=False)
68
  >>> outputs[0]["generated_text"]
69
  'The image showcases a vibrant scene of nature, featuring several flowers and a bee. \n\n1. **Foreground Flowers**: \n - The primary focus is on a large, pink cosmos flower with a prominent yellow center. The petals are soft and slightly r'
 
80
  >>> import torch
81
 
82
  >>> torch_device = "cuda"
83
+ >>> model_checkpoint = "OpenGVLab/InternVL3-8B-hf"
84
  >>> processor = AutoProcessor.from_pretrained(model_checkpoint)
85
  >>> model = AutoModelForImageTextToText.from_pretrained(model_checkpoint, device_map=torch_device, torch_dtype=torch.bfloat16)
86
 
 
112
  >>> import torch
113
 
114
  >>> torch_device = "cuda"
115
+ >>> model_checkpoint = "OpenGVLab/InternVL3-8B-hf"
116
  >>> processor = AutoProcessor.from_pretrained(model_checkpoint)
117
  >>> model = AutoModelForImageTextToText.from_pretrained(model_checkpoint, device_map=torch_device, torch_dtype=torch.bfloat16)
118
 
 
142
  >>> import torch
143
 
144
  >>> torch_device = "cuda"
145
+ >>> model_checkpoint = "OpenGVLab/InternVL3-8B-hf"
146
  >>> processor = AutoProcessor.from_pretrained(model_checkpoint)
147
  >>> model = AutoModelForImageTextToText.from_pretrained(model_checkpoint, device_map=torch_device, torch_dtype=torch.bfloat16)
148
 
 
186
  >>> import torch
187
 
188
  >>> torch_device = "cuda"
189
+ >>> model_checkpoint = "OpenGVLab/InternVL3-8B-hf"
190
  >>> processor = AutoProcessor.from_pretrained(model_checkpoint)
191
  >>> model = AutoModelForImageTextToText.from_pretrained(model_checkpoint, device_map=torch_device, torch_dtype=torch.bfloat16)
192
 
 
268
  >>> import torch
269
 
270
  >>> torch_device = "cuda"
271
+ >>> model_checkpoint = "OpenGVLab/InternVL3-8B-hf"
272
  >>> processor = AutoProcessor.from_pretrained(model_checkpoint)
273
  >>> model = AutoModelForImageTextToText.from_pretrained(model_checkpoint, device_map=torch_device, torch_dtype=torch.bfloat16)
274