yonigozlan HF Staff commited on
Commit
4f0cf36
·
verified ·
1 Parent(s): b9ef831

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +334 -176
README.md CHANGED
@@ -1,199 +1,357 @@
1
  ---
 
 
 
 
2
  library_name: transformers
3
- tags: []
 
 
 
 
 
 
 
 
4
  ---
5
 
6
- # Model Card for Model ID
7
 
8
- <!-- Provide a quick summary of what the model is/does. -->
9
 
 
10
 
 
 
 
11
 
12
- ## Model Details
13
 
14
- ### Model Description
 
 
 
15
 
16
- <!-- Provide a longer summary of what this model is. -->
17
 
18
- This is the model card of a 🤗 transformers model that has been pushed on the Hub. This model card has been automatically generated.
 
 
19
 
20
- - **Developed by:** [More Information Needed]
21
- - **Funded by [optional]:** [More Information Needed]
22
- - **Shared by [optional]:** [More Information Needed]
23
- - **Model type:** [More Information Needed]
24
- - **Language(s) (NLP):** [More Information Needed]
25
- - **License:** [More Information Needed]
26
- - **Finetuned from model [optional]:** [More Information Needed]
27
 
28
- ### Model Sources [optional]
29
 
30
- <!-- Provide the basic links for the model. -->
31
 
32
- - **Repository:** [More Information Needed]
33
- - **Paper [optional]:** [More Information Needed]
34
- - **Demo [optional]:** [More Information Needed]
35
 
36
- ## Uses
37
 
38
- <!-- Address questions around how the model is intended to be used, including the foreseeable users of the model and those affected by the model. -->
 
39
 
40
- ### Direct Use
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
41
 
42
- <!-- This section is for the model use without fine-tuning or plugging into a larger ecosystem/app. -->
43
 
44
- [More Information Needed]
 
45
 
46
- ### Downstream Use [optional]
 
 
47
 
48
- <!-- This section is for the model use when fine-tuned for a task, or when plugged into a larger ecosystem/app -->
 
 
 
49
 
50
- [More Information Needed]
51
-
52
- ### Out-of-Scope Use
53
-
54
- <!-- This section addresses misuse, malicious use, and uses that the model will not work well for. -->
55
-
56
- [More Information Needed]
57
-
58
- ## Bias, Risks, and Limitations
59
-
60
- <!-- This section is meant to convey both technical and sociotechnical limitations. -->
61
-
62
- [More Information Needed]
63
-
64
- ### Recommendations
65
-
66
- <!-- This section is meant to convey recommendations with respect to the bias, risk, and technical limitations. -->
67
-
68
- Users (both direct and downstream) should be made aware of the risks, biases and limitations of the model. More information needed for further recommendations.
69
-
70
- ## How to Get Started with the Model
71
-
72
- Use the code below to get started with the model.
73
-
74
- [More Information Needed]
75
-
76
- ## Training Details
77
-
78
- ### Training Data
79
-
80
- <!-- This should link to a Dataset Card, perhaps with a short stub of information on what the training data is all about as well as documentation related to data pre-processing or additional filtering. -->
81
-
82
- [More Information Needed]
83
-
84
- ### Training Procedure
85
-
86
- <!-- This relates heavily to the Technical Specifications. Content here should link to that section when it is relevant to the training procedure. -->
87
-
88
- #### Preprocessing [optional]
89
-
90
- [More Information Needed]
91
-
92
-
93
- #### Training Hyperparameters
94
-
95
- - **Training regime:** [More Information Needed] <!--fp32, fp16 mixed precision, bf16 mixed precision, bf16 non-mixed precision, fp16 non-mixed precision, fp8 mixed precision -->
96
-
97
- #### Speeds, Sizes, Times [optional]
98
-
99
- <!-- This section provides information about throughput, start/end time, checkpoint size if relevant, etc. -->
100
-
101
- [More Information Needed]
102
-
103
- ## Evaluation
104
-
105
- <!-- This section describes the evaluation protocols and provides the results. -->
106
-
107
- ### Testing Data, Factors & Metrics
108
-
109
- #### Testing Data
110
-
111
- <!-- This should link to a Dataset Card if possible. -->
112
-
113
- [More Information Needed]
114
-
115
- #### Factors
116
-
117
- <!-- These are the things the evaluation is disaggregating by, e.g., subpopulations or domains. -->
118
-
119
- [More Information Needed]
120
-
121
- #### Metrics
122
-
123
- <!-- These are the evaluation metrics being used, ideally with a description of why. -->
124
-
125
- [More Information Needed]
126
-
127
- ### Results
128
-
129
- [More Information Needed]
130
-
131
- #### Summary
132
-
133
-
134
-
135
- ## Model Examination [optional]
136
-
137
- <!-- Relevant interpretability work for the model goes here -->
138
-
139
- [More Information Needed]
140
-
141
- ## Environmental Impact
142
-
143
- <!-- Total emissions (in grams of CO2eq) and additional considerations, such as electricity usage, go here. Edit the suggested text below accordingly -->
144
-
145
- Carbon emissions can be estimated using the [Machine Learning Impact calculator](https://mlco2.github.io/impact#compute) presented in [Lacoste et al. (2019)](https://arxiv.org/abs/1910.09700).
146
-
147
- - **Hardware Type:** [More Information Needed]
148
- - **Hours used:** [More Information Needed]
149
- - **Cloud Provider:** [More Information Needed]
150
- - **Compute Region:** [More Information Needed]
151
- - **Carbon Emitted:** [More Information Needed]
152
-
153
- ## Technical Specifications [optional]
154
-
155
- ### Model Architecture and Objective
156
-
157
- [More Information Needed]
158
-
159
- ### Compute Infrastructure
160
-
161
- [More Information Needed]
162
-
163
- #### Hardware
164
-
165
- [More Information Needed]
166
-
167
- #### Software
168
-
169
- [More Information Needed]
170
-
171
- ## Citation [optional]
172
-
173
- <!-- If there is a paper or blog post introducing the model, the APA and Bibtex information for that should go in this section. -->
174
-
175
- **BibTeX:**
176
-
177
- [More Information Needed]
178
-
179
- **APA:**
180
-
181
- [More Information Needed]
182
-
183
- ## Glossary [optional]
184
-
185
- <!-- If relevant, include terms and calculations in this section that can help readers understand the model or model card. -->
186
-
187
- [More Information Needed]
188
-
189
- ## More Information [optional]
190
-
191
- [More Information Needed]
192
-
193
- ## Model Card Authors [optional]
194
-
195
- [More Information Needed]
196
-
197
- ## Model Card Contact
198
-
199
- [More Information Needed]
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
  ---
2
+ license: other
3
+ license_name: qwen
4
+ license_link: https://huggingface.co/Qwen/Qwen2.5-72B-Instruct/blob/main/LICENSE
5
+ pipeline_tag: image-text-to-text
6
  library_name: transformers
7
+ base_model:
8
+ - OpenGVLab/InternVL3-2B-Instruct
9
+ base_model_relation: finetune
10
+ datasets:
11
+ - OpenGVLab/MMPR-v1.2
12
+ language:
13
+ - multilingual
14
+ tags:
15
+ - internvl
16
  ---
17
 
18
+ # InternVL3-2B Transformers 🤗 Implementation
19
 
20
+ [\[📜 InternVL 1.0\]](https://huggingface.co/papers/2312.14238) [\[📜 InternVL 1.5\]](https://huggingface.co/papers/2404.16821) [\[📜 InternVL 2.5\]](https://huggingface.co/papers/2412.05271) [\[📜 InternVL2.5-MPO\]](https://huggingface.co/papers/2411.10442) [\[📜 InternVL3\]](https://huggingface.co/papers/2504.10479)
21
 
22
+ [\[🆕 Blog\]](https://internvl.github.io/blog/) [\[🗨️ Chat Demo\]](https://internvl.opengvlab.com/) [\[🤗 HF Demo\]](https://huggingface.co/spaces/OpenGVLab/InternVL) [\[🚀 Quick Start\]](#quick-start) [\[📖 Documents\]](https://internvl.readthedocs.io/en/latest/)
23
 
24
+ <div align="center">
25
+ <img width="500" alt="image" src="https://cdn-uploads.huggingface.co/production/uploads/64006c09330a45b03605bba3/zJsd2hqd3EevgXo6fNgC-.png">
26
+ </div>
27
 
 
28
 
29
+ > [!IMPORTANT]
30
+ > This repository contains the Hugging Face 🤗 Transformers implementation for the [OpenGVLab/InternVL3-2B](https://huggingface.co/OpenGVLab/InternVL3-2B) model.
31
+ > It is intended to be functionally equivalent to the original OpenGVLab release.
32
+ > As a native Transformers model, it supports core library features such as various attention implementations (eager, including SDPA, and FA2) and enables efficient batched inference with interleaved image, video, and text inputs.
33
 
34
+ ## Introduction
35
 
36
+ We introduce InternVL3, an advanced multimodal large language model (MLLM) series that demonstrates superior overall performance.
37
+ Compared to InternVL 2.5, InternVL3 exhibits superior multimodal perception and reasoning capabilities, while further extending its multimodal capabilities to encompass tool usage, GUI agents, industrial image analysis, 3D vision perception, and more.
38
+ Additionally, we compare InternVL3 with Qwen2.5 Chat models, whose corresponding pre-trained base models are employed as the initialization of the langauge component in InternVL3. Benefitting from Native Multimodal Pre-Training, the InternVL3 series achieves even better overall text performance than the Qwen2.5 series.
39
 
40
+ ![image/png](https://huggingface.co/datasets/Weiyun1025/InternVL-Performance/resolve/main/internvl3/overall.png)
 
 
 
 
 
 
41
 
42
+ You can find more info on the InternVL3 family in the original checkpoint [OpenGVLab/InternVL3-2B](https://huggingface.co/OpenGVLab/InternVL3-2B)
43
 
44
+ ## Usage example
45
 
46
+ ### Inference with Pipeline
 
 
47
 
48
+ Here is how you can use the `image-text-to-text` pipeline to perform inference with the `InternVL3` models in just a few lines of code:
49
 
50
+ ```python
51
+ >>> from transformers import pipeline
52
 
53
+ >>> messages = [
54
+ ... {
55
+ ... "role": "user",
56
+ ... "content": [
57
+ ... {
58
+ ... "type": "image",
59
+ ... "image": "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/bee.jpg",
60
+ ... },
61
+ ... {"type": "text", "text": "Describe this image."},
62
+ ... ],
63
+ ... },
64
+ ... ]
65
+
66
+ >>> pipe = pipeline("image-text-to-text", model="OpenGVLab/InternVL3-2B-hf")
67
+ >>> outputs = pipe(text=messages, max_new_tokens=50, return_full_text=False)
68
+ >>> outputs[0]["generated_text"]
69
+ 'The image showcases a vibrant scene of nature, featuring several flowers and a bee. \n\n1. **Foreground Flowers**: \n - The primary focus is on a large, pink cosmos flower with a prominent yellow center. The petals are soft and slightly r'
70
+ ```
71
+ ### Inference on a single image
72
 
73
+ This example demonstrates how to perform inference on a single image with the InternVL models using chat templates.
74
 
75
+ > [!NOTE]
76
+ > Note that the model has been trained with a specific prompt format for chatting. Use `processor.apply_chat_template(my_conversation_dict)` to correctly format your prompts.
77
 
78
+ ```python
79
+ >>> from transformers import AutoProcessor, AutoModelForImageTextToText
80
+ >>> import torch
81
 
82
+ >>> torch_device = "cuda"
83
+ >>> model_checkpoint = "OpenGVLab/InternVL3-2B-hf"
84
+ >>> processor = AutoProcessor.from_pretrained(model_checkpoint)
85
+ >>> model = AutoModelForImageTextToText.from_pretrained(model_checkpoint, device_map=torch_device, torch_dtype=torch.bfloat16)
86
 
87
+ >>> messages = [
88
+ ... {
89
+ ... "role": "user",
90
+ ... "content": [
91
+ ... {"type": "image", "url": "http://images.cocodataset.org/val2017/000000039769.jpg"},
92
+ ... {"type": "text", "text": "Please describe the image explicitly."},
93
+ ... ],
94
+ ... }
95
+ ... ]
96
+
97
+ >>> inputs = processor.apply_chat_template(messages, add_generation_prompt=True, tokenize=True, return_dict=True, return_tensors="pt").to(model.device, dtype=torch.bfloat16)
98
+
99
+ >>> generate_ids = model.generate(**inputs, max_new_tokens=50)
100
+ >>> decoded_output = processor.decode(generate_ids[0, inputs["input_ids"].shape[1] :], skip_special_tokens=True)
101
+
102
+ >>> decoded_output
103
+ 'The image shows two cats lying on a pink blanket. The cat on the left is a tabby with a mix of brown, black, and white fur, and it appears to be sleeping with its head resting on the blanket. The cat on the'
104
+ ```
105
+
106
+ ### Text-only generation
107
+ This example shows how to generate text using the InternVL model without providing any image input.
108
+
109
+
110
+ ```python
111
+ >>> from transformers import AutoProcessor, AutoModelForImageTextToText
112
+ >>> import torch
113
+
114
+ >>> torch_device = "cuda"
115
+ >>> model_checkpoint = "OpenGVLab/InternVL3-2B-hf"
116
+ >>> processor = AutoProcessor.from_pretrained(model_checkpoint)
117
+ >>> model = AutoModelForImageTextToText.from_pretrained(model_checkpoint, device_map=torch_device, torch_dtype=torch.bfloat16)
118
+
119
+ >>> messages = [
120
+ ... {
121
+ ... "role": "user",
122
+ ... "content": [
123
+ ... {"type": "text", "text": "Write a haiku"},
124
+ ... ],
125
+ ... }
126
+ ... ]
127
+
128
+ >>> inputs = processor.apply_chat_template(messages, add_generation_prompt=True, tokenize=True, return_dict=True, return_tensors="pt").to(torch_device, dtype=torch.bfloat16)
129
+
130
+ >>> generate_ids = model.generate(**inputs, max_new_tokens=50)
131
+ >>> decoded_output = processor.decode(generate_ids[0, inputs["input_ids"].shape[1] :], skip_special_tokens=True)
132
+
133
+ >>> print(decoded_output)
134
+ "Whispers of dawn,\nSilent whispers of the night,\nNew day's light begins."
135
+ ```
136
+
137
+ ### Batched image and text inputs
138
+ InternVL models also support batched image and text inputs.
139
+
140
+ ```python
141
+ >>> from transformers import AutoProcessor, AutoModelForImageTextToText
142
+ >>> import torch
143
+
144
+ >>> torch_device = "cuda"
145
+ >>> model_checkpoint = "OpenGVLab/InternVL3-2B-hf"
146
+ >>> processor = AutoProcessor.from_pretrained(model_checkpoint)
147
+ >>> model = AutoModelForImageTextToText.from_pretrained(model_checkpoint, device_map=torch_device, torch_dtype=torch.bfloat16)
148
+
149
+ >>> messages = [
150
+ ... [
151
+ ... {
152
+ ... "role": "user",
153
+ ... "content": [
154
+ ... {"type": "image", "url": "https://llava-vl.github.io/static/images/view.jpg"},
155
+ ... {"type": "text", "text": "Write a haiku for this image"},
156
+ ... ],
157
+ ... },
158
+ ... ],
159
+ ... [
160
+ ... {
161
+ ... "role": "user",
162
+ ... "content": [
163
+ ... {"type": "image", "url": "https://www.ilankelman.org/stopsigns/australia.jpg"},
164
+ ... {"type": "text", "text": "Describe this image"},
165
+ ... ],
166
+ ... },
167
+ ... ],
168
+ ... ]
169
+
170
+
171
+ >>> inputs = processor.apply_chat_template(messages, padding=True, add_generation_prompt=True, tokenize=True, return_dict=True, return_tensors="pt").to(model.device, dtype=torch.bfloat16)
172
+
173
+ >>> output = model.generate(**inputs, max_new_tokens=25)
174
+
175
+ >>> decoded_outputs = processor.batch_decode(output, skip_special_tokens=True)
176
+ >>> decoded_outputs
177
+ ["user\n\nWrite a haiku for this image\nassistant\nSilky lake, \nWooden pier, \nNature's peace.",
178
+ 'user\n\nDescribe this image\nassistant\nThe image shows a street scene with a traditional Chinese archway, known as a "Chinese Gate" or "Chinese Gate of']
179
+ ```
180
+
181
+ ### Batched multi-image input
182
+ This implementation of the InternVL models supports batched text-images inputs with different number of images for each text.
183
+
184
+ ```python
185
+ >>> from transformers import AutoProcessor, AutoModelForImageTextToText
186
+ >>> import torch
187
+
188
+ >>> torch_device = "cuda"
189
+ >>> model_checkpoint = "OpenGVLab/InternVL3-2B-hf"
190
+ >>> processor = AutoProcessor.from_pretrained(model_checkpoint)
191
+ >>> model = AutoModelForImageTextToText.from_pretrained(model_checkpoint, device_map=torch_device, torch_dtype=torch.bfloat16)
192
+
193
+ >>> messages = [
194
+ ...     [
195
+ ...         {
196
+ ...             "role": "user",
197
+ ...             "content": [
198
+ ...                 {"type": "image", "url": "https://llava-vl.github.io/static/images/view.jpg"},
199
+ ...                 {"type": "text", "text": "Write a haiku for this image"},
200
+ ...             ],
201
+ ...         },
202
+ ...     ],
203
+ ...     [
204
+ ...         {
205
+ ...             "role": "user",
206
+ ...             "content": [
207
+ ...                 {"type": "image", "url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg"},
208
+ ...                 {"type": "image", "url": "https://thumbs.dreamstime.com/b/golden-gate-bridge-san-francisco-purple-flowers-california-echium-candicans-36805947.jpg"},
209
+ ...                 {"type": "text", "text": "These images depict two different landmarks. Can you identify them?"},
210
+ ...             ],
211
+ ...         },
212
+ ...     ],
213
+ >>> ]
214
+
215
+ >>> inputs = processor.apply_chat_template(messages, padding=True, add_generation_prompt=True, tokenize=True, return_dict=True, return_tensors="pt").to(model.device, dtype=torch.bfloat16)
216
+
217
+ >>> output = model.generate(**inputs, max_new_tokens=25)
218
+
219
+ >>> decoded_outputs = processor.batch_decode(output, skip_special_tokens=True)
220
+ >>> decoded_outputs
221
+ ["user\n\nWrite a haiku for this image\nassistant\nSilky lake, \nWooden pier, \nNature's peace.",
222
+ 'user\n\n\nThese images depict two different landmarks. Can you identify them?\nassistant\nYes, these images depict the Statue of Liberty and the Golden Gate Bridge.']
223
+ ```
224
+
225
+ ### Video input
226
+ InternVL models can also handle video inputs. Here is an example of how to perform inference on a video input using chat templates.
227
+
228
+ ```python
229
+ >>> from transformers import AutoProcessor, AutoModelForImageTextToText, BitsAndBytesConfig
230
+
231
+ >>> model_checkpoint = "OpenGVLab/InternVL3-8B-hf"
232
+ >>> quantization_config = BitsAndBytesConfig(load_in_4bit=True)
233
+ >>> processor = AutoProcessor.from_pretrained(model_checkpoint)
234
+ >>> model = AutoModelForImageTextToText.from_pretrained(model_checkpoint, quantization_config=quantization_config)
235
+
236
+ >>> messages = [
237
+ ... {
238
+ ... "role": "user",
239
+ ... "content": [
240
+ ... {
241
+ ... "type": "video",
242
+ ... "url": "https://huggingface.co/datasets/hf-internal-testing/fixtures_videos/resolve/main/tennis.mp4",
243
+ ... },
244
+ ... {"type": "text", "text": "What type of shot is the man performing?"},
245
+ ... ],
246
+ ... }
247
+ >>> ]
248
+ >>> inputs = processor.apply_chat_template(
249
+ ... messages,
250
+ ... return_tensors="pt",
251
+ ... add_generation_prompt=True,
252
+ ... tokenize=True,
253
+ ... return_dict=True,
254
+ >>> ).to(model.device, dtype=torch.float16)
255
+
256
+ >>> output = model.generate(**inputs, max_new_tokens=25)
257
+
258
+ >>> decoded_output = processor.decode(output[0, inputs["input_ids"].shape[1] :], skip_special_tokens=True)
259
+ >>> decoded_output
260
+ 'The man is performing a forehand shot.'
261
+ ```
262
+
263
+ ### Interleaved image and video inputs
264
+ This example showcases how to handle a batch of chat conversations with interleaved image and video inputs using chat template.
265
+
266
+ ```python
267
+ >>> from transformers import AutoProcessor, AutoModelForImageTextToText, BitsAndBytesConfig
268
+ >>> import torch
269
+
270
+ >>> torch_device = "cuda"
271
+ >>> model_checkpoint = "OpenGVLab/InternVL3-2B-hf"
272
+ >>> processor = AutoProcessor.from_pretrained(model_checkpoint)
273
+ >>> model = AutoModelForImageTextToText.from_pretrained(model_checkpoint, device_map=torch_device, torch_dtype=torch.bfloat16)
274
+
275
+ >>> messages = [
276
+ ...     [
277
+ ...         {
278
+ ...             "role": "user",
279
+ ...             "content": [
280
+ ...                 {"type": "image", "url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg"},
281
+ ...                 {"type": "image", "url": "https://thumbs.dreamstime.com/b/golden-gate-bridge-san-francisco-purple-flowers-california-echium-candicans-36805947.jpg"},
282
+ ...                 {"type": "text", "text": "These images depict two different landmarks. Can you identify them?"},
283
+ ...             ],
284
+ ...         },
285
+ ...     ],
286
+ ...     [
287
+ ...         {
288
+ ...             "role": "user",
289
+ ...             "content": [
290
+ ...                 {"type": "video", "url": "https://huggingface.co/datasets/hf-internal-testing/fixtures_videos/resolve/main/tennis.mp4"},
291
+ ...                 {"type": "text", "text": "What type of shot is the man performing?"},
292
+ ...             ],
293
+ ...         },
294
+ ...     ],
295
+ ...     [
296
+ ...         {
297
+ ...             "role": "user",
298
+ ...             "content": [
299
+ ...                 {"type": "image", "url": "https://llava-vl.github.io/static/images/view.jpg"},
300
+ ...                 {"type": "text", "text": "Write a haiku for this image"},
301
+ ...             ],
302
+ ...         },
303
+ ...     ],
304
+ >>> ]
305
+ >>> inputs = processor.apply_chat_template(
306
+ ...     messages,
307
+ ...     padding=True,
308
+ ... add_generation_prompt=True,
309
+ ... tokenize=True,
310
+ ... return_dict=True,
311
+ ...     return_tensors="pt",
312
+ >>> ).to(model.device, dtype=torch.bfloat16)
313
+
314
+ >>> outputs = model.generate(**inputs, max_new_tokens=25)
315
+
316
+ >>> decoded_outputs = processor.batch_decode(outputs, skip_special_tokens=True)
317
+ >>> decoded_outputs
318
+ ['user\n\n\nThese images depict two different landmarks. Can you identify them?\nassistant\nThe images depict the Statue of Liberty and the Golden Gate Bridge.',
319
+ 'user\nFrame1: \nFrame2: \nFrame3: \nFrame4: \nFrame5: \nFrame6: \nFrame7: \nFrame8: \nWhat type of shot is the man performing?\nassistant\nA forehand shot',
320
+ "user\n\nWrite a haiku for this image\nassistant\nSilky lake, \nWooden pier, \nNature's peace."]
321
+ ```
322
+
323
+ ## License
324
+
325
+ This project is released under the MIT License. This project uses the pre-trained Qwen2.5 as a component, which is licensed under the Qwen License.
326
+
327
+ ## Citation
328
+
329
+ If you find this project useful in your research, please consider citing:
330
+
331
+ ```BibTeX
332
+ @article{chen2024expanding,
333
+ title={Expanding Performance Boundaries of Open-Source Multimodal Models with Model, Data, and Test-Time Scaling},
334
+ author={Chen, Zhe and Wang, Weiyun and Cao, Yue and Liu, Yangzhou and Gao, Zhangwei and Cui, Erfei and Zhu, Jinguo and Ye, Shenglong and Tian, Hao and Liu, Zhaoyang and others},
335
+ journal={arXiv preprint arXiv:2412.05271},
336
+ year={2024}
337
+ }
338
+ @article{wang2024mpo,
339
+ title={Enhancing the Reasoning Ability of Multimodal Large Language Models via Mixed Preference Optimization},
340
+ author={Wang, Weiyun and Chen, Zhe and Wang, Wenhai and Cao, Yue and Liu, Yangzhou and Gao, Zhangwei and Zhu, Jinguo and Zhu, Xizhou and Lu, Lewei and Qiao, Yu and Dai, Jifeng},
341
+ journal={arXiv preprint arXiv:2411.10442},
342
+ year={2024}
343
+ }
344
+ @article{chen2024far,
345
+ title={How Far Are We to GPT-4V? Closing the Gap to Commercial Multimodal Models with Open-Source Suites},
346
+ author={Chen, Zhe and Wang, Weiyun and Tian, Hao and Ye, Shenglong and Gao, Zhangwei and Cui, Erfei and Tong, Wenwen and Hu, Kongzhi and Luo, Jiapeng and Ma, Zheng and others},
347
+ journal={arXiv preprint arXiv:2404.16821},
348
+ year={2024}
349
+ }
350
+ @inproceedings{chen2024internvl,
351
+ title={Internvl: Scaling up vision foundation models and aligning for generic visual-linguistic tasks},
352
+ author={Chen, Zhe and Wu, Jiannan and Wang, Wenhai and Su, Weijie and Chen, Guo and Xing, Sen and Zhong, Muyan and Zhang, Qinglong and Zhu, Xizhou and Lu, Lewei and others},
353
+ booktitle={Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition},
354
+ pages={24185--24198},
355
+ year={2024}
356
+ }
357
+ ```