openbmb
/

MiniCPM-V-4_5-int4

@@ -25,7 +25,7 @@ tags:
 **MiniCPM-V 4.5** is the latest and most capable model in the MiniCPM-V series. The model is built on Qwen3-8B and SigLIP2-400M with a total of 8B parameters. It exhibits a significant performance improvement over previous MiniCPM-V and MiniCPM-o models, and introduces new useful features. Notable features of MiniCPM-V 4.5 include:
 - 🔥 **State-of-the-art Vision-Language Capability.**
-  MiniCPM-V 4.5 achieves an average score of 77.2 on OpenCompass, a comprehensive evaluation of 8 popular benchmarks. **With only 8B parameters, it surpasses widely used proprietary models like GPT-4o-latest, Gemini-2.0 Pro, and strong open-source models like Qwen2.5-VL 72B** for vision-language capabilities, making it the most performant MLLM under 30B parameters.
 - 🎬 **Efficient High Refresh Rate and Long Video Understanding.** Powered by a new unified 3D-Resampler over images and videos, MiniCPM-V 4.5 can now achieve 96x compression rate for video tokens, where 6 448x448 video frames can be jointly compressed into 64 video tokens (normally 1,536 tokens for most MLLMs). This means that the model can percieve significantly more video frames without increasing the LLM inference cost. This brings state-of-the-art high refresh rate (up to 10FPS) video understanding and long video understanding capabilities on Video-MME, LVBench, MLVU, MotionBench, FavorBench, etc., efficiently.
@@ -45,7 +45,7 @@ MiniCPM-V 4.5 can be easily used in various ways: (1) [llama.cpp](https://github
   <img src="https://raw.githubusercontent.com/openbmb/MiniCPM-o/main/assets/radar_minicpm_v45.png", width=60%>
 </div>
 <div align="center">
-<img src="https://raw.githubusercontent.com/openbmb/MiniCPM-o/main/assets/minicpmv_4_5_evaluation_results.jpg" , width=100%>
 </div>
 ### Examples
@@ -75,6 +75,9 @@ We deploy MiniCPM-V 4.5 on iPad M4 with [iOS demo](https://github.com/tc-mb/Mini
 ## Usage
 ```python
 import torch
 from PIL import Image
@@ -89,7 +92,7 @@ tokenizer = AutoTokenizer.from_pretrained('openbmb/MiniCPM-V-4_5', trust_remote_
 image = Image.open('./assets/minicpmo2_6/show_demo.jpg').convert('RGB')
-enable_thinking=False # If `enable_thinking=True`, the long-thinking mode is enabled.
 # First round chat
 question = "What is the landform in the picture?"
@@ -134,7 +137,6 @@ By following these guidelines, you'll have a safe and enjoyable trip while appre
 #### Chat with Video
-<summary> Click to view Python code running MiniCPM-V-4_5 by with video input and 3D-Resampler. </summary>
 ```python
 ## The 3d-resampler compresses multiple frames into 64 tokens by introducing temporal_ids.
@@ -251,7 +253,7 @@ print(answer)
 👏 Welcome to explore key techniques of MiniCPM-V 4.5 and other multimodal projects of our team:
-[VisCPM](https://github.com/OpenBMB/VisCPM/tree/main) | [RLHF-V](https://github.com/RLHF-V/RLHF-V) | [LLaVA-UHD](https://github.com/thunlp/LLaVA-UHD)  | [RLAIF-V](https://github.com/RLHF-V/RLAIF-V)
 ## Citation

 **MiniCPM-V 4.5** is the latest and most capable model in the MiniCPM-V series. The model is built on Qwen3-8B and SigLIP2-400M with a total of 8B parameters. It exhibits a significant performance improvement over previous MiniCPM-V and MiniCPM-o models, and introduces new useful features. Notable features of MiniCPM-V 4.5 include:
 - 🔥 **State-of-the-art Vision-Language Capability.**
+  MiniCPM-V 4.5 achieves an average score of 77.0 on OpenCompass, a comprehensive evaluation of 8 popular benchmarks. **With only 8B parameters, it surpasses widely used proprietary models like GPT-4o-latest, Gemini-2.0 Pro, and strong open-source models like Qwen2.5-VL 72B** for vision-language capabilities, making it the most performant MLLM under 30B parameters.
 - 🎬 **Efficient High Refresh Rate and Long Video Understanding.** Powered by a new unified 3D-Resampler over images and videos, MiniCPM-V 4.5 can now achieve 96x compression rate for video tokens, where 6 448x448 video frames can be jointly compressed into 64 video tokens (normally 1,536 tokens for most MLLMs). This means that the model can percieve significantly more video frames without increasing the LLM inference cost. This brings state-of-the-art high refresh rate (up to 10FPS) video understanding and long video understanding capabilities on Video-MME, LVBench, MLVU, MotionBench, FavorBench, etc., efficiently.
   <img src="https://raw.githubusercontent.com/openbmb/MiniCPM-o/main/assets/radar_minicpm_v45.png", width=60%>
 </div>
 <div align="center">
+<img src="https://raw.githubusercontent.com/openbmb/MiniCPM-o/main/assets/minicpmv_4_5_evaluation_result.png" , width=100%>
 </div>
 ### Examples
 ## Usage
+If you wish to enable thinking mode, provide the argument `enable_thinking=True` to the chat function.
+#### Chat with Image
 ```python
 import torch
 from PIL import Image
 image = Image.open('./assets/minicpmo2_6/show_demo.jpg').convert('RGB')
+enable_thinking=False # If `enable_thinking=True`, the thinking mode is enabled.
 # First round chat
 question = "What is the landform in the picture?"
 #### Chat with Video
 ```python
 ## The 3d-resampler compresses multiple frames into 64 tokens by introducing temporal_ids.
 👏 Welcome to explore key techniques of MiniCPM-V 4.5 and other multimodal projects of our team:
+[VisCPM](https://github.com/OpenBMB/VisCPM/tree/main) | [RLPR](https://github.com/OpenBMB/RLPR) |  [RLHF-V](https://github.com/RLHF-V/RLHF-V) | [LLaVA-UHD](https://github.com/thunlp/LLaVA-UHD)  | [RLAIF-V](https://github.com/RLHF-V/RLAIF-V)
 ## Citation