yuanzu commited on
Commit
ce34483
·
verified ·
1 Parent(s): 277034a

Update README.md

Browse files

add more description for int8 weight

Files changed (1) hide show
  1. README.md +31 -5
README.md CHANGED
@@ -2,6 +2,37 @@
2
  license: mit
3
  library_name: transformers
4
  ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
5
  # DeepSeek-R1
6
  <!-- markdownlint-disable first-line-h1 -->
7
  <!-- markdownlint-disable html -->
@@ -46,11 +77,6 @@ library_name: transformers
46
  <a href="https://github.com/deepseek-ai/DeepSeek-R1/blob/main/DeepSeek_R1.pdf"><b>Paper Link</b>👁️</a>
47
  </p>
48
 
49
- ## 0. INT8 Quantization
50
-
51
- We apply a INT8 quantization on the BF16 checkpoints, where weight scales are determined by dividing he block-wise maximum of element values by the INT8 type maximum.
52
- The quantization script is provided in inference/bf16_case_int8.py.
53
-
54
  ## 1. Introduction
55
 
56
  We introduce our first-generation reasoning models, DeepSeek-R1-Zero and DeepSeek-R1.
 
2
  license: mit
3
  library_name: transformers
4
  ---
5
+
6
+ # Block-wise INT8 DeepSeek-R1
7
+
8
+ The INT8 data type is both friendly and efficient for most hardware platforms.
9
+
10
+ **We provide a block-wise INT8 weight for DeepSeek-R1.**
11
+
12
+ In benchmarking, we observe **no accuracy loss** and up to **30\%** performance enhancement.
13
+
14
+ [SGLang](https://github.com/sgl-project/sglang/tree/main) will soon support the block-wise INT8 quantization operation once our [PULL REQUEST](https://github.com/sgl-project/sglang/pull/3730) is merged.
15
+
16
+ ## 1. Benchmarking Result (detailed in [PULL REQUEST](https://github.com/sgl-project/sglang/pull/3730)):
17
+ | Model | Config | Accuracy (GSM8K) | Accuracy (MMLU) | Output Throughput(qps=128) | Output Throughput(bs=1) |
18
+ |--------|--------|-------------------|----------------|------------------------------|--------------------------|
19
+ | BF16 R1 | (A100\*16)x2 | 95.8 | 87.1 | 4450.02 (+33%) | 44.18 (+18%) |
20
+ | INT8 R1 | A100\*32 | 95.5 | 87.1 | 3342.29 | 37.20 |
21
+
22
+ ## 2. Quantization Process
23
+
24
+ We apply INT8 quantization to the BF16 checkpoints.
25
+
26
+ The weight scales are determined by dividing the block-wise maximum of element values by the INT8 type maximum.
27
+
28
+ To generate this weight, run the provided script in the ``./inference`` directory:
29
+
30
+ ``
31
+ python3 bf16_case_int8.py --input-bf16-hf-path /path/to/bf16-weights/ --output-int8-hf-path /path/to/save-int8-weight/
32
+ ``
33
+
34
+ ---
35
+
36
  # DeepSeek-R1
37
  <!-- markdownlint-disable first-line-h1 -->
38
  <!-- markdownlint-disable html -->
 
77
  <a href="https://github.com/deepseek-ai/DeepSeek-R1/blob/main/DeepSeek_R1.pdf"><b>Paper Link</b>👁️</a>
78
  </p>
79
 
 
 
 
 
 
80
  ## 1. Introduction
81
 
82
  We introduce our first-generation reasoning models, DeepSeek-R1-Zero and DeepSeek-R1.