File size: 5,460 Bytes
d784b4c 1dbc9c0 d784b4c bfa18ca 1dbc9c0 d784b4c f004119 d784b4c b7433e7 d784b4c bfa18ca d784b4c e583a1b f5f3e62 e583a1b 1cbd6dc e583a1b 4c8e593 e583a1b 4c8e593 e583a1b 4c8e593 e583a1b 9f4609e 4eec6d2 9f4609e e583a1b 8f9a19f e583a1b e045437 8c7e1c1 e045437 e583a1b e045437 4b64daf e583a1b a1d92a4 9f4609e df6afb4 73a8ee8 9f4609e e583a1b d784b4c |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 |
---
license: mit
base_model:
- deepseek-ai/DeepSeek-R1-0528
---
# Model Overview
- **Model Architecture:** DeepSeek-R1-0528
- **Input:** Text
- **Output:** Text
- **Supported Hardware Microarchitecture:** AMD MI350/MI355
- **ROCm**: 7.0
- **PyTorch**: 2.8.0
- **Transformers**: 4.53.0
- **Operating System(s):** Linux
- **Inference Engine:** [SGLang](https://docs.sglang.ai/)/[vLLM](https://docs.vllm.ai/en/latest/)
- **Model Optimizer:** [AMD-Quark](https://quark.docs.amd.com/latest/index.html) (V0.10)
- **Weight quantization:** OCP MXFP4, Static
- **Activation quantization:** OCP MXFP4, Dynamic
- **Calibration Dataset:** [Pile](https://huggingface.co/datasets/mit-han-lab/pile-val-backup)
This model was built with deepseek-ai DeepSeek-R1-0528 model by applying [AMD-Quark](https://quark.docs.amd.com/latest/index.html) for MXFP4 quantization.
# Model Quantization
The model was quantized from [deepseek-ai/DeepSeek-R1-0528](https://huggingface.co/deepseek-ai/DeepSeek-R1-0528) using [AMD-Quark](https://quark.docs.amd.com/latest/index.html). Both weights and activations were quantized to MXFP4 format, and the AutoSmoothQuant algorithm was applied to enhance accuracy.
**Preprocessing requirement:**
Before executing the quantization script below, the original FP8 model must first be dequantized to BFloat16.
You can either perform the dequantization manually using this [conversion script](https://github.com/deepseek-ai/DeepSeek-V3/blob/main/inference/fp8_cast_bf16.py), or use the pre-converted BFloat16 model available at [unsloth/DeepSeek-R1-0528-BF16](https://huggingface.co/unsloth/DeepSeek-R1-0528-BF16).
**Quantization scripts:**
```
cd Quark/examples/torch/language_modeling/llm_ptq/
exclude_layers="*self_attn* *mlp.gate.* *lm_head"
python3 quantize_quark.py --model_dir $MODEL_DIR \
--quant_scheme w_mxfp4_a_mxfp4 \
--num_calib_data 128 \
--exclude_layers $exclude_layers \
--skip_evaluation \
--multi_gpu \
--quant_algo autosmoothquant \
--model_export hf_format \
--output_dir amd/DeepSeek-R1-0528-MXFP4-ASQ
```
# Deployment
This model can be deployed efficiently using the [SGLang](https://docs.sglang.ai/) and [vLLM](https://docs.vllm.ai/en/latest/) backends.
## Evaluation
The model was evaluated on AIME24, GPQA Diamond, MATH-500, and GSM8K benchmarks. The tasks of AIME24, GPQA Diamond, and MATH-500 were conducted using [lighteval](https://github.com/huggingface/lighteval/tree/v0.10.0) with 10 rounds for different generation seeds. GSM8K was conducted using [lm-eval-harness](https://github.com/EleutherAI/lm-evaluation-harness).
### Accuracy
<table>
<tr>
<td><strong>Benchmark</strong>
</td>
<td><strong>DeepSeek-R1-0528 </strong>
</td>
<td><strong>DeepSeek-R1-0528-MXFP4-ASQ(this model)</strong>
</td>
<td><strong>Recovery</strong>
</td>
</tr>
<tr>
<td>AIME24
</td>
<td>88.00
</td>
<td>87.67
</td>
<td>99.62%
</td>
</tr>
<tr>
<td>GPQA Diamond
</td>
<td>79.90
</td>
<td>79.65
</td>
<td>99.69%
</td>
</tr>
<tr>
<td>MATH-500
</td>
<td>97.06
</td>
<td>96.90
</td>
<td>99.84%
</td>
</tr>
<tr>
<td>GSM8K
</td>
<td>95.30
</td>
<td>95.18
</td>
<td>99.87%
</td>
</tr>
</table>
### Reproduction
The results of AIME24, MATH-500 and GPQA Diamond were obtained using forked [lighteval](https://github.com/zhaolin-amd/lighteval/tree/v0.10-release-custom) and vLLM docker (emulation qdq) `rocm/vllm-private:pytorch-vllm-gfx950-mxfp4-mxfp6-v3`.
```
# Set docker env
export VLLM_QUARK_F4F6_OFFLINE_DEQUANT_TMPENVVAR=1
# Set model args
MODEL_ARGS="model_name=amd/DeepSeek-R1-0528-MXFP4-ASQ,dtype=bfloat16,tensor_parallel_size=8,max_model_length=71536,max_num_batched_tokens=32768,gpu_memory_utilization=0.85,generation_parameters={max_new_tokens:65536,temperature:0.6,top_p:0.95,seed:$SEED}"
OUTPUT_DIR="results/DeepSeek-R1-0528-MXFP4-ASQ-Seed"
LOG="logs/deepseek_0528_maxfp4.log"
# Evaluating 10 rounds
for i in $(seq 1 10); do
# seed in [0, 2**30 - 1]
SEED=$(shuf -i 0-1073741823 -n 1)
lighteval vllm $MODEL_ARGS "custom|aime24_single|0|0,custom|math_500_single|0|0,custom|gpqa:diamond_single|0|0" \
--use-chat-template \
--output-dir "$OUTPUT_DIR/seed_$SEED" \
2>&1 | tee -a "$LOG"
```
The result of GSM8K was obtained using [lm-eval-harness](https://github.com/EleutherAI/lm-evaluation-harness) and SGLang [docker](https://hub.docker.com/layers/lmsysorg/sglang/v0.5.3.post3-rocm700-mi35x-srt/images/sha256-8c7281fcd4adc7942c7e674d464fee322d1775d7b546596ab4cc7edd258517fc).
```
# Launching server
SGLANG_USE_AITER=1 python -m sglang.launch_server \
--model-path $MODEL_DIR \
--tp 8 \
--port 8000 \
--attention-backend aiter
# Evaluating
MODEL_ARGS="model=amd/DeepSeek-R1-0528-MXFP4-ASQ,base_url=http://localhost:8000/v1/completions,num_concurrent=999999,timeout=999999,tokenized_requests=False,max_length=38768,temperature=0.6,top_p=0.95,add_bos_token=True,seed=$SEED"
lm_eval \
--model local-completions \
--model_args $MODEL_ARGS \
--tasks gsm8k \
--num_fewshot 8 \
--batch_size auto
```
# License
Modifications Copyright(c) 2025 Advanced Micro Devices, Inc. All rights reserved. |