nielsr HF Staff commited on
Commit
252627e
·
verified ·
1 Parent(s): f2ed63e

Improve model card for InternVL2_5-2B-MPO: Add abstract and enhance tags

Browse files

Hello team,

I've reviewed the model card for `InternVL2_5-2B-MPO` and identified a few improvements to make it even more informative and discoverable:

1. **Added Paper Abstract:** I've included the abstract from the paper "Expanding Performance Boundaries of Open-Source Multimodal Models with Model, Data, and Test-Time Scaling" (arXiv:2412.05271) under a new `

Files changed (1) hide show
  1. README.md +73 -33
README.md CHANGED
@@ -1,17 +1,19 @@
1
  ---
2
- license: mit
3
- pipeline_tag: image-text-to-text
4
- library_name: transformers
5
  base_model:
6
- - OpenGVLab/InternVL2_5-2B
7
- base_model_relation: finetune
8
  datasets:
9
- - OpenGVLab/MMPR-v1.1
10
  language:
11
- - multilingual
 
 
 
12
  tags:
13
- - internvl
14
- - custom_code
 
 
 
15
  ---
16
 
17
  # InternVL2_5-2B-MPO
@@ -24,6 +26,9 @@ tags:
24
  <img width="500" alt="image" src="https://cdn-uploads.huggingface.co/production/uploads/64006c09330a45b03605bba3/zJsd2hqd3EevgXo6fNgC-.png">
25
  </div>
26
 
 
 
 
27
  ## Introduction
28
 
29
  We introduce InternVL2.5-MPO, an advanced multimodal large language model (MLLM) series that demonstrates superior overall performance. This series builds upon InternVL2.5 and Mixed Preference Optimization.
@@ -113,18 +118,18 @@ Additionally, the BCO loss is employed as the quality loss, which helps the mode
113
  The loss function is defined as:
114
 
115
  $$
116
- \mathcal{L}_{\text{q}}=\mathcal{L}_{\text{q}}^+ + \mathcal{L}_{\text{q}}^-,
117
  $$
118
 
119
  where \\(\mathcal{L}_{\text{q}}^{+}\\) and \\(\mathcal{L}_{\text{q}}^{+}\\) represent the loss for chosen and rejected responses, respectively.
120
  Each response type's loss is calculated independently, requiring the model to differentiate the absolute quality of individual responses. The loss terms are given by:
121
 
122
  $$
123
- \mathcal{L}_{\text{q}}^+=-\log \sigma\left(\beta \log \frac{\pi_\theta\left(y_c \mid x\right)}{\pi_0\left(y_c \mid x\right)} - \delta\right),
124
  $$
125
 
126
  $$
127
- \mathcal{L}_{\text{q}}^-=-\log \sigma\left(-\left(\beta \log \frac{\pi_\theta\left(y_r \mid x\right)}{\pi_0\left(y_r \mid x\right)} - \delta\right) \right),
128
  $$
129
 
130
  where \\(\delta\\) represents the reward shift, calculated as the moving average of previous rewards to stabilize training.
@@ -133,7 +138,7 @@ Finally, the SFT loss is used as the generation loss to help the model learn the
133
  The loss function is defined as:
134
 
135
  $$
136
- \mathcal{L}_{\text{gen}}=-\frac{\log\pi_\theta\left(y_c \mid x\right)}{\left| y_c \right|}.
137
  $$
138
 
139
  ## Evaluation on Multimodal Capability
@@ -344,40 +349,50 @@ generation_config = dict(max_new_tokens=1024, do_sample=True)
344
  # pure-text conversation (纯文本对话)
345
  question = 'Hello, who are you?'
346
  response, history = model.chat(tokenizer, None, question, generation_config, history=None, return_history=True)
347
- print(f'User: {question}\nAssistant: {response}')
 
348
 
349
  question = 'Can you tell me a story?'
350
  response, history = model.chat(tokenizer, None, question, generation_config, history=history, return_history=True)
351
- print(f'User: {question}\nAssistant: {response}')
 
352
 
353
  # single-image single-round conversation (单图单轮对话)
354
- question = '<image>\nPlease describe the image shortly.'
 
355
  response = model.chat(tokenizer, pixel_values, question, generation_config)
356
- print(f'User: {question}\nAssistant: {response}')
 
357
 
358
  # single-image multi-round conversation (单图多轮对话)
359
- question = '<image>\nPlease describe the image in detail.'
 
360
  response, history = model.chat(tokenizer, pixel_values, question, generation_config, history=None, return_history=True)
361
- print(f'User: {question}\nAssistant: {response}')
 
362
 
363
  question = 'Please write a poem according to the image.'
364
  response, history = model.chat(tokenizer, pixel_values, question, generation_config, history=history, return_history=True)
365
- print(f'User: {question}\nAssistant: {response}')
 
366
 
367
  # multi-image multi-round conversation, combined images (多图多轮对话,拼接图像)
368
  pixel_values1 = load_image('./examples/image1.jpg', max_num=12).to(torch.bfloat16).cuda()
369
  pixel_values2 = load_image('./examples/image2.jpg', max_num=12).to(torch.bfloat16).cuda()
370
  pixel_values = torch.cat((pixel_values1, pixel_values2), dim=0)
371
 
372
- question = '<image>\nDescribe the two images in detail.'
 
373
  response, history = model.chat(tokenizer, pixel_values, question, generation_config,
374
  history=None, return_history=True)
375
- print(f'User: {question}\nAssistant: {response}')
 
376
 
377
  question = 'What are the similarities and differences between these two images.'
378
  response, history = model.chat(tokenizer, pixel_values, question, generation_config,
379
  history=history, return_history=True)
380
- print(f'User: {question}\nAssistant: {response}')
 
381
 
382
  # multi-image multi-round conversation, separate images (多图多轮对话,独立图像)
383
  pixel_values1 = load_image('./examples/image1.jpg', max_num=12).to(torch.bfloat16).cuda()
@@ -385,17 +400,21 @@ pixel_values2 = load_image('./examples/image2.jpg', max_num=12).to(torch.bfloat1
385
  pixel_values = torch.cat((pixel_values1, pixel_values2), dim=0)
386
  num_patches_list = [pixel_values1.size(0), pixel_values2.size(0)]
387
 
388
- question = 'Image-1: <image>\nImage-2: <image>\nDescribe the two images in detail.'
 
 
389
  response, history = model.chat(tokenizer, pixel_values, question, generation_config,
390
  num_patches_list=num_patches_list,
391
  history=None, return_history=True)
392
- print(f'User: {question}\nAssistant: {response}')
 
393
 
394
  question = 'What are the similarities and differences between these two images.'
395
  response, history = model.chat(tokenizer, pixel_values, question, generation_config,
396
  num_patches_list=num_patches_list,
397
  history=history, return_history=True)
398
- print(f'User: {question}\nAssistant: {response}')
 
399
 
400
  # batch inference, single image per sample (单图批处理)
401
  pixel_values1 = load_image('./examples/image1.jpg', max_num=12).to(torch.bfloat16).cuda()
@@ -403,13 +422,15 @@ pixel_values2 = load_image('./examples/image2.jpg', max_num=12).to(torch.bfloat1
403
  num_patches_list = [pixel_values1.size(0), pixel_values2.size(0)]
404
  pixel_values = torch.cat((pixel_values1, pixel_values2), dim=0)
405
 
406
- questions = ['<image>\nDescribe the image in detail.'] * len(num_patches_list)
 
407
  responses = model.batch_chat(tokenizer, pixel_values,
408
  num_patches_list=num_patches_list,
409
  questions=questions,
410
  generation_config=generation_config)
411
  for question, response in zip(questions, responses):
412
- print(f'User: {question}\nAssistant: {response}')
 
413
 
414
  # video multi-round conversation (视频多轮对话)
415
  def get_index(bound, fps, max_frame, first_idx=0, num_segments=32):
@@ -447,17 +468,24 @@ def load_video(video_path, bound=None, input_size=448, max_num=1, num_segments=3
447
  video_path = './examples/red-panda.mp4'
448
  pixel_values, num_patches_list = load_video(video_path, num_segments=8, max_num=1)
449
  pixel_values = pixel_values.to(torch.bfloat16).cuda()
450
- video_prefix = ''.join([f'Frame{i+1}: <image>\n' for i in range(len(num_patches_list))])
 
451
  question = video_prefix + 'What is the red panda doing?'
452
- # Frame1: <image>\nFrame2: <image>\n...\nFrame8: <image>\n{question}
 
 
 
 
453
  response, history = model.chat(tokenizer, pixel_values, question, generation_config,
454
  num_patches_list=num_patches_list, history=None, return_history=True)
455
- print(f'User: {question}\nAssistant: {response}')
 
456
 
457
  question = 'Describe this video in detail.'
458
  response, history = model.chat(tokenizer, pixel_values, question, generation_config,
459
  num_patches_list=num_patches_list, history=history, return_history=True)
460
- print(f'User: {question}\nAssistant: {response}')
 
461
  ```
462
 
463
  #### Streaming Output
@@ -539,7 +567,9 @@ image_urls=[
539
 
540
  images = [load_image(img_url) for img_url in image_urls]
541
  # Numbering images improves multi-image conversations
542
- response = pipe((f'Image-1: {IMAGE_TOKEN}\nImage-2: {IMAGE_TOKEN}\ndescribe these two images', images))
 
 
543
  print(response.text)
544
  ```
545
 
@@ -659,3 +689,13 @@ If you find this project useful in your research, please consider citing:
659
  year={2024}
660
  }
661
  ```
 
 
 
 
 
 
 
 
 
 
 
1
  ---
 
 
 
2
  base_model:
3
+ - OpenGVLab/InternVL2_5-2B
 
4
  datasets:
5
+ - OpenGVLab/MMPR-v1.1
6
  language:
7
+ - multilingual
8
+ library_name: transformers
9
+ license: mit
10
+ pipeline_tag: image-text-to-text
11
  tags:
12
+ - internvl
13
+ - custom_code
14
+ - multimodal
15
+ - vision-language-model
16
+ base_model_relation: finetune
17
  ---
18
 
19
  # InternVL2_5-2B-MPO
 
26
  <img width="500" alt="image" src="https://cdn-uploads.huggingface.co/production/uploads/64006c09330a45b03605bba3/zJsd2hqd3EevgXo6fNgC-.png">
27
  </div>
28
 
29
+ ## Abstract
30
+ We introduce InternVL 2.5, an advanced multimodal large language model (MLLM) series that builds upon InternVL 2.0, maintaining its core model architecture while introducing significant enhancements in training and testing strategies as well as data quality. In this work, we delve into the relationship between model scaling and performance, systematically exploring the performance trends in vision encoders, language models, dataset sizes, and test-time configurations. Through extensive evaluations on a wide range of benchmarks, including multi-discipline reasoning, document understanding, multi-image / video understanding, real-world comprehension, multimodal hallucination detection, visual grounding, multilingual capabilities, and pure language processing, InternVL 2.5 exhibits competitive performance, rivaling leading commercial models such as GPT-4o and Claude-3.5-Sonnet. Notably, our model is the first open-source MLLMs to surpass 70% on the MMMU benchmark, achieving a 3.7-point improvement through Chain-of-Thought (CoT) reasoning and showcasing strong potential for test-time scaling. We hope this model contributes to the open-source community by setting new standards for developing and applying multimodal AI systems. HuggingFace demo see this https URL
31
+
32
  ## Introduction
33
 
34
  We introduce InternVL2.5-MPO, an advanced multimodal large language model (MLLM) series that demonstrates superior overall performance. This series builds upon InternVL2.5 and Mixed Preference Optimization.
 
118
  The loss function is defined as:
119
 
120
  $$
121
+ \mathcal{L}_{\text{q}}=\mathcal{L}_{\text{q}}^+ + \mathcal{L}_{\text{q}}^-,\tag{3}
122
  $$
123
 
124
  where \\(\mathcal{L}_{\text{q}}^{+}\\) and \\(\mathcal{L}_{\text{q}}^{+}\\) represent the loss for chosen and rejected responses, respectively.
125
  Each response type's loss is calculated independently, requiring the model to differentiate the absolute quality of individual responses. The loss terms are given by:
126
 
127
  $$
128
+ \mathcal{L}_{\text{q}}^+=-\log \sigma\left(\beta \log \frac{\pi_\theta\left(y_c \mid x\right)}{\pi_0\left(y_c \mid x\right)} - \delta\right),\tag{4}
129
  $$
130
 
131
  $$
132
+ \mathcal{L}_{\text{q}}^-=-\log \sigma\left(-\left(\beta \log \frac{\pi_\theta\left(y_r \mid x\right)}{\pi_0\left(y_r \mid x\right)} - \delta\right) \right),\tag{5}
133
  $$
134
 
135
  where \\(\delta\\) represents the reward shift, calculated as the moving average of previous rewards to stabilize training.
 
138
  The loss function is defined as:
139
 
140
  $$
141
+ \mathcal{L}_{\text{gen}}=-\frac{\log\pi_\theta\left(y_c \mid x\right)}{\left| y_c \right|}.\tag{6}
142
  $$
143
 
144
  ## Evaluation on Multimodal Capability
 
349
  # pure-text conversation (纯文本对话)
350
  question = 'Hello, who are you?'
351
  response, history = model.chat(tokenizer, None, question, generation_config, history=None, return_history=True)
352
+ print(f'User: {question}
353
+ Assistant: {response}')
354
 
355
  question = 'Can you tell me a story?'
356
  response, history = model.chat(tokenizer, None, question, generation_config, history=history, return_history=True)
357
+ print(f'User: {question}
358
+ Assistant: {response}')
359
 
360
  # single-image single-round conversation (单图单轮对话)
361
+ question = '<image>
362
+ Please describe the image shortly.'
363
  response = model.chat(tokenizer, pixel_values, question, generation_config)
364
+ print(f'User: {question}
365
+ Assistant: {response}')
366
 
367
  # single-image multi-round conversation (单图多轮对话)
368
+ question = '<image>
369
+ Please describe the image in detail.'
370
  response, history = model.chat(tokenizer, pixel_values, question, generation_config, history=None, return_history=True)
371
+ print(f'User: {question}
372
+ Assistant: {response}')
373
 
374
  question = 'Please write a poem according to the image.'
375
  response, history = model.chat(tokenizer, pixel_values, question, generation_config, history=history, return_history=True)
376
+ print(f'User: {question}
377
+ Assistant: {response}')
378
 
379
  # multi-image multi-round conversation, combined images (多图多轮对话,拼接图像)
380
  pixel_values1 = load_image('./examples/image1.jpg', max_num=12).to(torch.bfloat16).cuda()
381
  pixel_values2 = load_image('./examples/image2.jpg', max_num=12).to(torch.bfloat16).cuda()
382
  pixel_values = torch.cat((pixel_values1, pixel_values2), dim=0)
383
 
384
+ question = '<image>
385
+ Describe the two images in detail.'
386
  response, history = model.chat(tokenizer, pixel_values, question, generation_config,
387
  history=None, return_history=True)
388
+ print(f'User: {question}
389
+ Assistant: {response}')
390
 
391
  question = 'What are the similarities and differences between these two images.'
392
  response, history = model.chat(tokenizer, pixel_values, question, generation_config,
393
  history=history, return_history=True)
394
+ print(f'User: {question}
395
+ Assistant: {response}')
396
 
397
  # multi-image multi-round conversation, separate images (多图多轮对话,独立图像)
398
  pixel_values1 = load_image('./examples/image1.jpg', max_num=12).to(torch.bfloat16).cuda()
 
400
  pixel_values = torch.cat((pixel_values1, pixel_values2), dim=0)
401
  num_patches_list = [pixel_values1.size(0), pixel_values2.size(0)]
402
 
403
+ question = 'Image-1: <image>
404
+ Image-2: <image>
405
+ Describe the two images in detail.'
406
  response, history = model.chat(tokenizer, pixel_values, question, generation_config,
407
  num_patches_list=num_patches_list,
408
  history=None, return_history=True)
409
+ print(f'User: {question}
410
+ Assistant: {response}')
411
 
412
  question = 'What are the similarities and differences between these two images.'
413
  response, history = model.chat(tokenizer, pixel_values, question, generation_config,
414
  num_patches_list=num_patches_list,
415
  history=history, return_history=True)
416
+ print(f'User: {question}
417
+ Assistant: {response}')
418
 
419
  # batch inference, single image per sample (单图批处理)
420
  pixel_values1 = load_image('./examples/image1.jpg', max_num=12).to(torch.bfloat16).cuda()
 
422
  num_patches_list = [pixel_values1.size(0), pixel_values2.size(0)]
423
  pixel_values = torch.cat((pixel_values1, pixel_values2), dim=0)
424
 
425
+ questions = ['<image>
426
+ Describe the image in detail.'] * len(num_patches_list)
427
  responses = model.batch_chat(tokenizer, pixel_values,
428
  num_patches_list=num_patches_list,
429
  questions=questions,
430
  generation_config=generation_config)
431
  for question, response in zip(questions, responses):
432
+ print(f'User: {question}
433
+ Assistant: {response}')
434
 
435
  # video multi-round conversation (视频多轮对话)
436
  def get_index(bound, fps, max_frame, first_idx=0, num_segments=32):
 
468
  video_path = './examples/red-panda.mp4'
469
  pixel_values, num_patches_list = load_video(video_path, num_segments=8, max_num=1)
470
  pixel_values = pixel_values.to(torch.bfloat16).cuda()
471
+ video_prefix = ''.join([f'Frame{i+1}: <image>
472
+ ' for i in range(len(num_patches_list))])
473
  question = video_prefix + 'What is the red panda doing?'
474
+ # Frame1: <image>
475
+ Frame2: <image>
476
+ ...
477
+ Frame8: <image>
478
+ {question}
479
  response, history = model.chat(tokenizer, pixel_values, question, generation_config,
480
  num_patches_list=num_patches_list, history=None, return_history=True)
481
+ print(f'User: {question}
482
+ Assistant: {response}')
483
 
484
  question = 'Describe this video in detail.'
485
  response, history = model.chat(tokenizer, pixel_values, question, generation_config,
486
  num_patches_list=num_patches_list, history=history, return_history=True)
487
+ print(f'User: {question}
488
+ Assistant: {response}')
489
  ```
490
 
491
  #### Streaming Output
 
567
 
568
  images = [load_image(img_url) for img_url in image_urls]
569
  # Numbering images improves multi-image conversations
570
+ response = pipe((f'Image-1: {IMAGE_TOKEN}
571
+ Image-2: {IMAGE_TOKEN}
572
+ describe these two images', images))
573
  print(response.text)
574
  ```
575
 
 
689
  year={2024}
690
  }
691
  ```
692
+
693
+ ## Acknowledgement
694
+
695
+ InternVL is built with reference to the code of the following projects: [OpenAI CLIP](https://github.com/openai/CLIP), [Open CLIP](https://github.com/mlfoundations/open_clip), [CLIP Benchmark](https://github.com/LAION-AI/CLIP_benchmark), [EVA](https://github.com/baaivision/EVA/tree/master), [InternImage](https://github.com/OpenGVLab/InternImage), [ViT-Adapter](https://github.com/czczup/ViT-Adapter), [MMSegmentation](https://github.com/open-mmlab/mmsegmentation), [Transformers](https://github.com/huggingface/transformers), [DINOv2](https://github.com/facebookresearch/dinov2), [BLIP-2](https://github.com/salesforce/LAVIS/tree/main/projects/blip2), [Qwen-VL](https://github.com/QwenLM/Qwen-VL/tree/master/eval_mm), and [LLaVA-1.5](https://github.com/haotian-liu/LLaVA). Thanks for their awesome work!
696
+
697
+ ______________________________________________________________________
698
+
699
+ Scan the following QR Code, join our WeChat group.
700
+
701
+ <p align="center"><img width="300" alt="image" src="https://github.com/user-attachments/assets/f776df09-ebba-4fd5-80c2-fec4ff1518be"></p>