Update README.md
Browse files
README.md
CHANGED
@@ -5,7 +5,7 @@ license_link: https://huggingface.co/Qwen/Qwen2.5-72B-Instruct/blob/main/LICENSE
|
|
5 |
pipeline_tag: image-text-to-text
|
6 |
library_name: transformers
|
7 |
base_model:
|
8 |
-
- OpenGVLab/InternVL3-
|
9 |
base_model_relation: finetune
|
10 |
datasets:
|
11 |
- OpenGVLab/MMPR-v1.2
|
@@ -15,7 +15,7 @@ tags:
|
|
15 |
- internvl
|
16 |
---
|
17 |
|
18 |
-
# InternVL3-
|
19 |
|
20 |
[\[π InternVL 1.0\]](https://huggingface.co/papers/2312.14238) [\[π InternVL 1.5\]](https://huggingface.co/papers/2404.16821) [\[π InternVL 2.5\]](https://huggingface.co/papers/2412.05271) [\[π InternVL2.5-MPO\]](https://huggingface.co/papers/2411.10442) [\[π InternVL3\]](https://huggingface.co/papers/2504.10479)
|
21 |
|
@@ -27,7 +27,7 @@ tags:
|
|
27 |
|
28 |
|
29 |
> [!IMPORTANT]
|
30 |
-
> This repository contains the Hugging Face π€ Transformers implementation for the [OpenGVLab/InternVL3-
|
31 |
> It is intended to be functionally equivalent to the original OpenGVLab release.
|
32 |
> As a native Transformers model, it supports core library features such as various attention implementations (eager, including SDPA, and FA2) and enables efficient batched inference with interleaved image, video, and text inputs.
|
33 |
|
@@ -39,7 +39,7 @@ Additionally, we compare InternVL3 with Qwen2.5 Chat models, whose correspondin
|
|
39 |
|
40 |

|
41 |
|
42 |
-
You can find more info on the InternVL3 family in the original checkpoint [OpenGVLab/InternVL3-
|
43 |
|
44 |
## Usage example
|
45 |
|
@@ -63,7 +63,7 @@ Here is how you can use the `image-text-to-text` pipeline to perform inference w
|
|
63 |
... },
|
64 |
... ]
|
65 |
|
66 |
-
>>> pipe = pipeline("image-text-to-text", model="OpenGVLab/InternVL3-
|
67 |
>>> outputs = pipe(text=messages, max_new_tokens=50, return_full_text=False)
|
68 |
>>> outputs[0]["generated_text"]
|
69 |
'The image showcases a vibrant scene of nature, featuring several flowers and a bee. \n\n1. **Foreground Flowers**: \n - The primary focus is on a large, pink cosmos flower with a prominent yellow center. The petals are soft and slightly r'
|
@@ -80,7 +80,7 @@ This example demonstrates how to perform inference on a single image with the In
|
|
80 |
>>> import torch
|
81 |
|
82 |
>>> torch_device = "cuda"
|
83 |
-
>>> model_checkpoint = "OpenGVLab/InternVL3-
|
84 |
>>> processor = AutoProcessor.from_pretrained(model_checkpoint)
|
85 |
>>> model = AutoModelForImageTextToText.from_pretrained(model_checkpoint, device_map=torch_device, torch_dtype=torch.bfloat16)
|
86 |
|
@@ -112,7 +112,7 @@ This example shows how to generate text using the InternVL model without providi
|
|
112 |
>>> import torch
|
113 |
|
114 |
>>> torch_device = "cuda"
|
115 |
-
>>> model_checkpoint = "OpenGVLab/InternVL3-
|
116 |
>>> processor = AutoProcessor.from_pretrained(model_checkpoint)
|
117 |
>>> model = AutoModelForImageTextToText.from_pretrained(model_checkpoint, device_map=torch_device, torch_dtype=torch.bfloat16)
|
118 |
|
@@ -142,7 +142,7 @@ InternVL models also support batched image and text inputs.
|
|
142 |
>>> import torch
|
143 |
|
144 |
>>> torch_device = "cuda"
|
145 |
-
>>> model_checkpoint = "OpenGVLab/InternVL3-
|
146 |
>>> processor = AutoProcessor.from_pretrained(model_checkpoint)
|
147 |
>>> model = AutoModelForImageTextToText.from_pretrained(model_checkpoint, device_map=torch_device, torch_dtype=torch.bfloat16)
|
148 |
|
@@ -186,7 +186,7 @@ This implementation of the InternVL models supports batched text-images inputs w
|
|
186 |
>>> import torch
|
187 |
|
188 |
>>> torch_device = "cuda"
|
189 |
-
>>> model_checkpoint = "OpenGVLab/InternVL3-
|
190 |
>>> processor = AutoProcessor.from_pretrained(model_checkpoint)
|
191 |
>>> model = AutoModelForImageTextToText.from_pretrained(model_checkpoint, device_map=torch_device, torch_dtype=torch.bfloat16)
|
192 |
|
@@ -268,7 +268,7 @@ This example showcases how to handle a batch of chat conversations with interlea
|
|
268 |
>>> import torch
|
269 |
|
270 |
>>> torch_device = "cuda"
|
271 |
-
>>> model_checkpoint = "OpenGVLab/InternVL3-
|
272 |
>>> processor = AutoProcessor.from_pretrained(model_checkpoint)
|
273 |
>>> model = AutoModelForImageTextToText.from_pretrained(model_checkpoint, device_map=torch_device, torch_dtype=torch.bfloat16)
|
274 |
|
|
|
5 |
pipeline_tag: image-text-to-text
|
6 |
library_name: transformers
|
7 |
base_model:
|
8 |
+
- OpenGVLab/InternVL3-8B-Instruct
|
9 |
base_model_relation: finetune
|
10 |
datasets:
|
11 |
- OpenGVLab/MMPR-v1.2
|
|
|
15 |
- internvl
|
16 |
---
|
17 |
|
18 |
+
# InternVL3-8B Transformers π€ Implementation
|
19 |
|
20 |
[\[π InternVL 1.0\]](https://huggingface.co/papers/2312.14238) [\[π InternVL 1.5\]](https://huggingface.co/papers/2404.16821) [\[π InternVL 2.5\]](https://huggingface.co/papers/2412.05271) [\[π InternVL2.5-MPO\]](https://huggingface.co/papers/2411.10442) [\[π InternVL3\]](https://huggingface.co/papers/2504.10479)
|
21 |
|
|
|
27 |
|
28 |
|
29 |
> [!IMPORTANT]
|
30 |
+
> This repository contains the Hugging Face π€ Transformers implementation for the [OpenGVLab/InternVL3-8B](https://huggingface.co/OpenGVLab/InternVL3-8B) model.
|
31 |
> It is intended to be functionally equivalent to the original OpenGVLab release.
|
32 |
> As a native Transformers model, it supports core library features such as various attention implementations (eager, including SDPA, and FA2) and enables efficient batched inference with interleaved image, video, and text inputs.
|
33 |
|
|
|
39 |
|
40 |

|
41 |
|
42 |
+
You can find more info on the InternVL3 family in the original checkpoint [OpenGVLab/InternVL3-8B](https://huggingface.co/OpenGVLab/InternVL3-8B)
|
43 |
|
44 |
## Usage example
|
45 |
|
|
|
63 |
... },
|
64 |
... ]
|
65 |
|
66 |
+
>>> pipe = pipeline("image-text-to-text", model="OpenGVLab/InternVL3-8B-hf")
|
67 |
>>> outputs = pipe(text=messages, max_new_tokens=50, return_full_text=False)
|
68 |
>>> outputs[0]["generated_text"]
|
69 |
'The image showcases a vibrant scene of nature, featuring several flowers and a bee. \n\n1. **Foreground Flowers**: \n - The primary focus is on a large, pink cosmos flower with a prominent yellow center. The petals are soft and slightly r'
|
|
|
80 |
>>> import torch
|
81 |
|
82 |
>>> torch_device = "cuda"
|
83 |
+
>>> model_checkpoint = "OpenGVLab/InternVL3-8B-hf"
|
84 |
>>> processor = AutoProcessor.from_pretrained(model_checkpoint)
|
85 |
>>> model = AutoModelForImageTextToText.from_pretrained(model_checkpoint, device_map=torch_device, torch_dtype=torch.bfloat16)
|
86 |
|
|
|
112 |
>>> import torch
|
113 |
|
114 |
>>> torch_device = "cuda"
|
115 |
+
>>> model_checkpoint = "OpenGVLab/InternVL3-8B-hf"
|
116 |
>>> processor = AutoProcessor.from_pretrained(model_checkpoint)
|
117 |
>>> model = AutoModelForImageTextToText.from_pretrained(model_checkpoint, device_map=torch_device, torch_dtype=torch.bfloat16)
|
118 |
|
|
|
142 |
>>> import torch
|
143 |
|
144 |
>>> torch_device = "cuda"
|
145 |
+
>>> model_checkpoint = "OpenGVLab/InternVL3-8B-hf"
|
146 |
>>> processor = AutoProcessor.from_pretrained(model_checkpoint)
|
147 |
>>> model = AutoModelForImageTextToText.from_pretrained(model_checkpoint, device_map=torch_device, torch_dtype=torch.bfloat16)
|
148 |
|
|
|
186 |
>>> import torch
|
187 |
|
188 |
>>> torch_device = "cuda"
|
189 |
+
>>> model_checkpoint = "OpenGVLab/InternVL3-8B-hf"
|
190 |
>>> processor = AutoProcessor.from_pretrained(model_checkpoint)
|
191 |
>>> model = AutoModelForImageTextToText.from_pretrained(model_checkpoint, device_map=torch_device, torch_dtype=torch.bfloat16)
|
192 |
|
|
|
268 |
>>> import torch
|
269 |
|
270 |
>>> torch_device = "cuda"
|
271 |
+
>>> model_checkpoint = "OpenGVLab/InternVL3-8B-hf"
|
272 |
>>> processor = AutoProcessor.from_pretrained(model_checkpoint)
|
273 |
>>> model = AutoModelForImageTextToText.from_pretrained(model_checkpoint, device_map=torch_device, torch_dtype=torch.bfloat16)
|
274 |
|