meituan
/

DeepSeek-R1-Block-INT8

Text Generation

text-generation-inference

8-bit precision

Model card Files Files and versions

Update README.md

#2

by yuanzu - opened Feb 24

base: refs/heads/main

←

from: refs/pr/2

Discussion Files changed

Files changed (1) hide show

README.md +31 -5

README.md CHANGED Viewed

@@ -2,6 +2,37 @@
 license: mit
 library_name: transformers
 ---
 # DeepSeek-R1
 <!-- markdownlint-disable first-line-h1 -->
 <!-- markdownlint-disable html -->
@@ -46,11 +77,6 @@ library_name: transformers
   <a href="https://github.com/deepseek-ai/DeepSeek-R1/blob/main/DeepSeek_R1.pdf"><b>Paper Link</b>👁️</a>
 </p>
-## 0. INT8 Quantization
-We apply a INT8 quantization on the BF16 checkpoints, where weight scales are determined by dividing he block-wise maximum of element values by the INT8 type maximum.
-The quantization script is provided in inference/bf16_case_int8.py.
 ## 1. Introduction
 We introduce our first-generation reasoning models, DeepSeek-R1-Zero and DeepSeek-R1.

 license: mit
 library_name: transformers
 ---
+# Block-wise INT8 DeepSeek-R1
+The INT8 data type is both friendly and efficient for most hardware platforms.
+**We provide a block-wise INT8 weight for DeepSeek-R1.**
+In benchmarking, we observe **no accuracy loss** and up to **30\%** performance enhancement.
+[SGLang](https://github.com/sgl-project/sglang/tree/main) will soon support the block-wise INT8 quantization operation once our [PULL REQUEST](https://github.com/sgl-project/sglang/pull/3730) is merged.
+## 1. Benchmarking Result (detailed in [PULL REQUEST](https://github.com/sgl-project/sglang/pull/3730)):
+| Model  | Config | Accuracy (GSM8K) | Accuracy (MMLU) | Output Throughput(qps=128) | Output Throughput(bs=1) |
+|--------|--------|-------------------|----------------|------------------------------|--------------------------|
+| BF16 R1 | (A100\*16)x2 | 95.8              | 87.1           | 4450.02 (+33%)                | 44.18 (+18%)             |
+| INT8 R1 | A100\*32  | 95.5              | 87.1           | 3342.29                       | 37.20                     |
+## 2. Quantization Process
+We apply INT8 quantization to the BF16 checkpoints.
+The weight scales are determined by dividing the block-wise maximum of element values by the INT8 type maximum.
+To generate this weight, run the provided script in the ``./inference`` directory:
+``
+python3 bf16_case_int8.py --input-bf16-hf-path /path/to/bf16-weights/ --output-int8-hf-path /path/to/save-int8-weight/
+``
+---
 # DeepSeek-R1
 <!-- markdownlint-disable first-line-h1 -->
 <!-- markdownlint-disable html -->
   <a href="https://github.com/deepseek-ai/DeepSeek-R1/blob/main/DeepSeek_R1.pdf"><b>Paper Link</b>👁️</a>
 </p>
 ## 1. Introduction
 We introduce our first-generation reasoning models, DeepSeek-R1-Zero and DeepSeek-R1.