Add link to paper in introduction

#1
by nielsr HF Staff - opened
Files changed (1) hide show
  1. README.md +82 -81
README.md CHANGED
@@ -1,81 +1,82 @@
1
- ---
2
- license: apache-2.0
3
- language:
4
- - zh
5
- - en
6
- pipeline_tag: text-generation
7
- library_name: transformers
8
- ---
9
- <div align="center">
10
- <img src="https://github.com/OpenBMB/MiniCPM/blob/main/assets/minicpm_logo.png?raw=true" width="500em" ></img>
11
- </div>
12
-
13
- <p align="center">
14
- <a href="https://github.com/OpenBMB/MiniCPM/" target="_blank">GitHub Repo</a> |
15
- <a href="https://github.com/OpenBMB/MiniCPM/tree/main/report/MiniCPM_4_Technical_Report.pdf" target="_blank">Technical Report</a>
16
- </p>
17
- <p align="center">
18
- ๐Ÿ‘‹ Join us on <a href="https://discord.gg/3cGQn9b3YM" target="_blank">Discord</a> and <a href="https://github.com/OpenBMB/MiniCPM/blob/main/assets/wechat.jpg" target="_blank">WeChat</a>
19
- </p>
20
-
21
- ## What's New
22
- - [2025.06.06] **MiniCPM4** series are released! This model achieves ultimate efficiency improvements while maintaining optimal performance at the same scale! It can achieve over 5x generation acceleration on typical end-side chips! You can find technical report [here](https://github.com/OpenBMB/MiniCPM/tree/main/report/MiniCPM_4_Technical_Report.pdf).๐Ÿ”ฅ๐Ÿ”ฅ๐Ÿ”ฅ
23
-
24
- ## MiniCPM4 Series
25
- MiniCPM4 series are highly efficient large language models (LLMs) designed explicitly for end-side devices, which achieves this efficiency through systematic innovation in four key dimensions: model architecture, training data, training algorithms, and inference systems.
26
- - [MiniCPM4-8B](https://huggingface.co/openbmb/MiniCPM4-8B): The flagship of MiniCPM4, with 8B parameters, trained on 8T tokens.
27
- - [MiniCPM4-0.5B](https://huggingface.co/openbmb/MiniCPM4-0.5B): The small version of MiniCPM4, with 0.5B parameters, trained on 1T tokens.
28
- - [MiniCPM4-8B-Eagle-FRSpec](https://huggingface.co/openbmb/MiniCPM4-8B-Eagle-FRSpec): Eagle head for FRSpec, accelerating speculative inference for MiniCPM4-8B.
29
- - [MiniCPM4-8B-Eagle-FRSpec-QAT-cpmcu](https://huggingface.co/openbmb/MiniCPM4-8B-Eagle-FRSpec-QAT-cpmcu): Eagle head trained with QAT for FRSpec, efficiently integrate speculation and quantization to achieve ultra acceleration for MiniCPM4-8B. (**<-- you are here**)
30
- - [MiniCPM4-8B-Eagle-vLLM](https://huggingface.co/openbmb/MiniCPM4-8B-Eagle-vLLM): Eagle head in vLLM format, accelerating speculative inference for MiniCPM4-8B.
31
- - [MiniCPM4-8B-marlin-Eagle-vLLM](https://huggingface.co/openbmb/MiniCPM4-8B-marlin-Eagle-vLLM): Quantized Eagle head for vLLM format, accelerating speculative inference for MiniCPM4-8B.
32
- - [BitCPM4-0.5B](https://huggingface.co/openbmb/BitCPM4-0.5B): Extreme ternary quantization applied to MiniCPM4-0.5B compresses model parameters into ternary values, achieving a 90% reduction in bit width.
33
- - [BitCPM4-1B](https://huggingface.co/openbmb/BitCPM4-1B): Extreme ternary quantization applied to MiniCPM3-1B compresses model parameters into ternary values, achieving a 90% reduction in bit width.
34
- - [MiniCPM4-Survey](https://huggingface.co/openbmb/MiniCPM4-Survey): Based on MiniCPM4-8B, accepts users' quiries as input and autonomously generate trustworthy, long-form survey papers.
35
- - [MiniCPM4-MCP](https://huggingface.co/openbmb/MiniCPM4-MCP): Based on MiniCPM4-8B, accepts users' queries and available MCP tools as input and autonomously calls relevant MCP tools to satisfy users' requirements.
36
-
37
- ## Introduction
38
- MiniCPM4-8B-Eagle-FRSpec-QAT is a quantization-friendly Eagle model trained with MiniCPM4-8B in QAT. It clould be apply on our inference framework [cpm.cu](https://github.com/OpenBMB/cpm.cu) with FRSpec, accelerating the generation speed by 7 times compared to Qwen3-8B.
39
-
40
- ## Usage
41
- ### Inference with cpm.cu
42
- ```
43
- # case 1: verify model is fp16 or bf16
44
- cd cpm.cu/tests
45
- python3 test_generate.py \
46
- --no-apply-quant \
47
- --apply-eagle-quant
48
-
49
- # case 2: verify model is quanted with Marlin (W4A16, group size = 128)
50
- cd cpm.cu/tests
51
- python3 test_generate.py \
52
- --apply-quant \
53
- --apply-eagle-quant
54
- ```
55
-
56
- ## Evaluation
57
-
58
- Tested on two representative edge devices, the Jetson AGX Orin and RTX 4090, MiniCPM4 with MiniCPM4-8B-Eagle-FRSpec-QAT demonstrates significantly superior processing speed over models of comparable size for long-text processing tasks. Its performance advantage becomes increasingly pronounced as the text length increases. On the Jetson AGX Orin platform, MiniCPM4 achieves approximately a 7x improvement in generation speed compared to Qwen3-8B.
59
-
60
- ![speed test](https://raw.githubusercontent.com/OpenBMB/MiniCPM/refs/heads/minicpm-4/assets/minicpm4/efficiency.png)
61
-
62
-
63
- ## Statement
64
- - As a language model, MiniCPM generates content by learning from a vast amount of text.
65
- - However, it does not possess the ability to comprehend or express personal opinions or value judgments.
66
- - Any content generated by MiniCPM does not represent the viewpoints or positions of the model developers.
67
- - Therefore, when using content generated by MiniCPM, users should take full responsibility for evaluating and verifying it on their own.
68
-
69
- ## LICENSE
70
- - This repository and MiniCPM models are released under the [Apache-2.0](https://github.com/OpenBMB/MiniCPM/blob/main/LICENSE) License.
71
-
72
- ## Citation
73
- - Please cite our [paper](https://github.com/OpenBMB/MiniCPM/tree/main/report/MiniCPM_4_Technical_Report.pdf) if you find our work valuable.
74
-
75
- ```bibtex
76
- @article{minicpm4,
77
- title={{MiniCPM4}: Ultra-Efficient LLMs on End Devices},
78
- author={MiniCPM Team},
79
- year={2025}
80
- }
81
- ```
 
 
1
+ ---
2
+ language:
3
+ - zh
4
+ - en
5
+ library_name: transformers
6
+ license: apache-2.0
7
+ pipeline_tag: text-generation
8
+ ---
9
+
10
+ <div align="center">
11
+ <img src="https://github.com/OpenBMB/MiniCPM/blob/main/assets/minicpm_logo.png?raw=true" width="500em" ></img>
12
+ </div>
13
+
14
+ <p align="center">
15
+ <a href="https://github.com/OpenBMB/MiniCPM/" target="_blank">GitHub Repo</a> |
16
+ <a href="https://github.com/OpenBMB/MiniCPM/tree/main/report/MiniCPM_4_Technical_Report.pdf" target="_blank">Technical Report</a>
17
+ </p>
18
+ <p align="center">
19
+ ๐Ÿ‘‹ Join us on <a href="https://discord.gg/3cGQn9b3YM" target="_blank">Discord</a> and <a href="https://github.com/OpenBMB/MiniCPM/blob/main/assets/wechat.jpg" target="_blank">WeChat</a>
20
+ </p>
21
+
22
+ ## What's New
23
+ - [2025.06.06] **MiniCPM4** series are released! This model achieves ultimate efficiency improvements while maintaining optimal performance at the same scale! It can achieve over 5x generation acceleration on typical end-side chips! You can find technical report [here](https://github.com/OpenBMB/MiniCPM/tree/main/report/MiniCPM_4_Technical_Report.pdf).๐Ÿ”ฅ๐Ÿ”ฅ๐Ÿ”ฅ
24
+
25
+ ## MiniCPM4 Series
26
+ MiniCPM4 series are highly efficient large language models (LLMs) designed explicitly for end-side devices, which achieves this efficiency through systematic innovation in four key dimensions: model architecture, training data, training algorithms, and inference systems.
27
+ - [MiniCPM4-8B](https://huggingface.co/openbmb/MiniCPM4-8B): The flagship of MiniCPM4, with 8B parameters, trained on 8T tokens.
28
+ - [MiniCPM4-0.5B](https://huggingface.co/openbmb/MiniCPM4-0.5B): The small version of MiniCPM4, with 0.5B parameters, trained on 1T tokens.
29
+ - [MiniCPM4-8B-Eagle-FRSpec](https://huggingface.co/openbmb/MiniCPM4-8B-Eagle-FRSpec): Eagle head for FRSpec, accelerating speculative inference for MiniCPM4-8B.
30
+ - [MiniCPM4-8B-Eagle-FRSpec-QAT-cpmcu](https://huggingface.co/openbmb/MiniCPM4-8B-Eagle-FRSpec-QAT-cpmcu): Eagle head trained with QAT for FRSpec, efficiently integrate speculation and quantization to achieve ultra acceleration for MiniCPM4-8B. (**<-- you are here**)
31
+ - [MiniCPM4-8B-Eagle-vLLM](https://huggingface.co/openbmb/MiniCPM4-8B-Eagle-vLLM): Eagle head in vLLM format, accelerating speculative inference for MiniCPM4-8B.
32
+ - [MiniCPM4-8B-marlin-Eagle-vLLM](https://huggingface.co/openbmb/MiniCPM4-8B-marlin-Eagle-vLLM): Quantized Eagle head for vLLM format, accelerating speculative inference for MiniCPM4-8B.
33
+ - [BitCPM4-0.5B](https://huggingface.co/openbmb/BitCPM4-0.5B): Extreme ternary quantization applied to MiniCPM4-0.5B compresses model parameters into ternary values, achieving a 90% reduction in bit width.
34
+ - [BitCPM4-1B](https://huggingface.co/openbmb/BitCPM4-1B): Extreme ternary quantization applied to MiniCPM3-1B compresses model parameters into ternary values, achieving a 90% reduction in bit width.
35
+ - [MiniCPM4-Survey](https://huggingface.co/openbmb/MiniCPM4-Survey): Based on MiniCPM4-8B, accepts users' quiries as input and autonomously generate trustworthy, long-form survey papers.
36
+ - [MiniCPM4-MCP](https://huggingface.co/openbmb/MiniCPM4-MCP): Based on MiniCPM4-8B, accepts users' queries and available MCP tools as input and autonomously calls relevant MCP tools to satisfy users' requirements.
37
+
38
+ ## Introduction
39
+ MiniCPM4-8B-Eagle-FRSpec-QAT is a quantization-friendly Eagle model trained with MiniCPM4-8B in QAT. The model was introduced in [MiniCPM4: Ultra-Efficient LLMs on End Devices](https://huggingface.co/papers/2506.07900). It clould be apply on our inference framework [cpm.cu](https://github.com/OpenBMB/cpm.cu) with FRSpec, accelerating the generation speed by 7 times compared to Qwen3-8B.
40
+
41
+ ## Usage
42
+ ### Inference with cpm.cu
43
+ ```
44
+ # case 1: verify model is fp16 or bf16
45
+ cd cpm.cu/tests
46
+ python3 test_generate.py \
47
+ --no-apply-quant \
48
+ --apply-eagle-quant
49
+
50
+ # case 2: verify model is quanted with Marlin (W4A16, group size = 128)
51
+ cd cpm.cu/tests
52
+ python3 test_generate.py \
53
+ --apply-quant \
54
+ --apply-eagle-quant
55
+ ```
56
+
57
+ ## Evaluation
58
+
59
+ Tested on two representative edge devices, the Jetson AGX Orin and RTX 4090, MiniCPM4 with MiniCPM4-8B-Eagle-FRSpec-QAT demonstrates significantly superior processing speed over models of comparable size for long-text processing tasks. Its performance advantage becomes increasingly pronounced as the text length increases. On the Jetson AGX Orin platform, MiniCPM4 achieves approximately a 7x improvement in generation speed compared to Qwen3-8B.
60
+
61
+ ![speed test](https://raw.githubusercontent.com/OpenBMB/MiniCPM/refs/heads/minicpm-4/assets/minicpm4/efficiency.png)
62
+
63
+
64
+ ## Statement
65
+ - As a language model, MiniCPM generates content by learning from a vast amount of text.
66
+ - However, it does not possess the ability to comprehend or express personal opinions or value judgments.
67
+ - Any content generated by MiniCPM does not represent the viewpoints or positions of the model developers.
68
+ - Therefore, when using content generated by MiniCPM, users should take full responsibility for evaluating and verifying it on their own.
69
+
70
+ ## LICENSE
71
+ - This repository and MiniCPM models are released under the [Apache-2.0](https://github.com/OpenBMB/MiniCPM/blob/main/LICENSE) License.
72
+
73
+ ## Citation
74
+ - Please cite our [paper](https://github.com/OpenBMB/MiniCPM/tree/main/report/MiniCPM_4_Technical_Report.pdf) if you find our work valuable.
75
+
76
+ ```bibtex
77
+ @article{minicpm4,
78
+ title={{MiniCPM4}: Ultra-Efficient LLMs on End Devices},
79
+ author={MiniCPM Team},
80
+ year={2025}
81
+ }
82
+ ```