tc-mb commited on
Commit
1264ed1
Β·
verified Β·
1 Parent(s): 66406d5

Update: README

Browse files
Files changed (1) hide show
  1. README.md +7 -5
README.md CHANGED
@@ -25,7 +25,7 @@ tags:
25
  **MiniCPM-V 4.5** is the latest and most capable model in the MiniCPM-V series. The model is built on Qwen3-8B and SigLIP2-400M with a total of 8B parameters. It exhibits a significant performance improvement over previous MiniCPM-V and MiniCPM-o models, and introduces new useful features. Notable features of MiniCPM-V 4.5 include:
26
 
27
  - πŸ”₯ **State-of-the-art Vision-Language Capability.**
28
- MiniCPM-V 4.5 achieves an average score of 77.2 on OpenCompass, a comprehensive evaluation of 8 popular benchmarks. **With only 8B parameters, it surpasses widely used proprietary models like GPT-4o-latest, Gemini-2.0 Pro, and strong open-source models like Qwen2.5-VL 72B** for vision-language capabilities, making it the most performant MLLM under 30B parameters.
29
 
30
  - 🎬 **Efficient High Refresh Rate and Long Video Understanding.** Powered by a new unified 3D-Resampler over images and videos, MiniCPM-V 4.5 can now achieve 96x compression rate for video tokens, where 6 448x448 video frames can be jointly compressed into 64 video tokens (normally 1,536 tokens for most MLLMs). This means that the model can percieve significantly more video frames without increasing the LLM inference cost. This brings state-of-the-art high refresh rate (up to 10FPS) video understanding and long video understanding capabilities on Video-MME, LVBench, MLVU, MotionBench, FavorBench, etc., efficiently.
31
 
@@ -45,7 +45,7 @@ MiniCPM-V 4.5 can be easily used in various ways: (1) [llama.cpp](https://github
45
  <img src="https://raw.githubusercontent.com/openbmb/MiniCPM-o/main/assets/radar_minicpm_v45.png", width=60%>
46
  </div>
47
  <div align="center">
48
- <img src="https://raw.githubusercontent.com/openbmb/MiniCPM-o/main/assets/minicpmv_4_5_evaluation_results.jpg" , width=100%>
49
  </div>
50
 
51
  ### Examples
@@ -75,6 +75,9 @@ We deploy MiniCPM-V 4.5 on iPad M4 with [iOS demo](https://github.com/tc-mb/Mini
75
 
76
  ## Usage
77
 
 
 
 
78
  ```python
79
  import torch
80
  from PIL import Image
@@ -89,7 +92,7 @@ tokenizer = AutoTokenizer.from_pretrained('openbmb/MiniCPM-V-4_5', trust_remote_
89
 
90
  image = Image.open('./assets/minicpmo2_6/show_demo.jpg').convert('RGB')
91
 
92
- enable_thinking=False # If `enable_thinking=True`, the long-thinking mode is enabled.
93
 
94
  # First round chat
95
  question = "What is the landform in the picture?"
@@ -134,7 +137,6 @@ By following these guidelines, you'll have a safe and enjoyable trip while appre
134
 
135
 
136
  #### Chat with Video
137
- <summary> Click to view Python code running MiniCPM-V-4_5 by with video input and 3D-Resampler. </summary>
138
 
139
  ```python
140
  ## The 3d-resampler compresses multiple frames into 64 tokens by introducing temporal_ids.
@@ -251,7 +253,7 @@ print(answer)
251
 
252
  πŸ‘ Welcome to explore key techniques of MiniCPM-V 4.5 and other multimodal projects of our team:
253
 
254
- [VisCPM](https://github.com/OpenBMB/VisCPM/tree/main) | [RLHF-V](https://github.com/RLHF-V/RLHF-V) | [LLaVA-UHD](https://github.com/thunlp/LLaVA-UHD) | [RLAIF-V](https://github.com/RLHF-V/RLAIF-V)
255
 
256
  ## Citation
257
 
 
25
  **MiniCPM-V 4.5** is the latest and most capable model in the MiniCPM-V series. The model is built on Qwen3-8B and SigLIP2-400M with a total of 8B parameters. It exhibits a significant performance improvement over previous MiniCPM-V and MiniCPM-o models, and introduces new useful features. Notable features of MiniCPM-V 4.5 include:
26
 
27
  - πŸ”₯ **State-of-the-art Vision-Language Capability.**
28
+ MiniCPM-V 4.5 achieves an average score of 77.0 on OpenCompass, a comprehensive evaluation of 8 popular benchmarks. **With only 8B parameters, it surpasses widely used proprietary models like GPT-4o-latest, Gemini-2.0 Pro, and strong open-source models like Qwen2.5-VL 72B** for vision-language capabilities, making it the most performant MLLM under 30B parameters.
29
 
30
  - 🎬 **Efficient High Refresh Rate and Long Video Understanding.** Powered by a new unified 3D-Resampler over images and videos, MiniCPM-V 4.5 can now achieve 96x compression rate for video tokens, where 6 448x448 video frames can be jointly compressed into 64 video tokens (normally 1,536 tokens for most MLLMs). This means that the model can percieve significantly more video frames without increasing the LLM inference cost. This brings state-of-the-art high refresh rate (up to 10FPS) video understanding and long video understanding capabilities on Video-MME, LVBench, MLVU, MotionBench, FavorBench, etc., efficiently.
31
 
 
45
  <img src="https://raw.githubusercontent.com/openbmb/MiniCPM-o/main/assets/radar_minicpm_v45.png", width=60%>
46
  </div>
47
  <div align="center">
48
+ <img src="https://raw.githubusercontent.com/openbmb/MiniCPM-o/main/assets/minicpmv_4_5_evaluation_result.png" , width=100%>
49
  </div>
50
 
51
  ### Examples
 
75
 
76
  ## Usage
77
 
78
+ If you wish to enable thinking mode, provide the argument `enable_thinking=True` to the chat function.
79
+
80
+ #### Chat with Image
81
  ```python
82
  import torch
83
  from PIL import Image
 
92
 
93
  image = Image.open('./assets/minicpmo2_6/show_demo.jpg').convert('RGB')
94
 
95
+ enable_thinking=False # If `enable_thinking=True`, the thinking mode is enabled.
96
 
97
  # First round chat
98
  question = "What is the landform in the picture?"
 
137
 
138
 
139
  #### Chat with Video
 
140
 
141
  ```python
142
  ## The 3d-resampler compresses multiple frames into 64 tokens by introducing temporal_ids.
 
253
 
254
  πŸ‘ Welcome to explore key techniques of MiniCPM-V 4.5 and other multimodal projects of our team:
255
 
256
+ [VisCPM](https://github.com/OpenBMB/VisCPM/tree/main) | [RLPR](https://github.com/OpenBMB/RLPR) | [RLHF-V](https://github.com/RLHF-V/RLHF-V) | [LLaVA-UHD](https://github.com/thunlp/LLaVA-UHD) | [RLAIF-V](https://github.com/RLHF-V/RLAIF-V)
257
 
258
  ## Citation
259