File size: 5,460 Bytes

d784b4c
 
 
 
 
 
 
 
 
 
 
 
 
 
1dbc9c0
 
d784b4c
bfa18ca
1dbc9c0
d784b4c
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
f004119
d784b4c
 
 
b7433e7
d784b4c
 
 
 
bfa18ca
d784b4c
e583a1b
 
f5f3e62
e583a1b
 
 
 
 
 
 
 
 
1cbd6dc
e583a1b
 
 
 
 
4c8e593
e583a1b
 
 
 
 
 
 
 
 
4c8e593
e583a1b
 
 
 
 
 
 
 
 
4c8e593
e583a1b
 
 
 
 
 
 
 
9f4609e
4eec6d2
9f4609e
 
 
 
 
 
 
 
e583a1b
 
 
 
 
8f9a19f
e583a1b
 
e045437
8c7e1c1
 
e045437
e583a1b
 
 
 
e045437
4b64daf
 
 
 
 
 
 
 
e583a1b
 
a1d92a4
9f4609e
 
df6afb4
 
 
 
 
 
 
73a8ee8
9f4609e
 
 
 
 
 
 
 
 
e583a1b
d784b4c

---
license: mit
base_model:
- deepseek-ai/DeepSeek-R1-0528
---


# Model Overview

- **Model Architecture:** DeepSeek-R1-0528
  - **Input:** Text
  - **Output:** Text
- **Supported Hardware Microarchitecture:** AMD MI350/MI355
- **ROCm**: 7.0
- **PyTorch**: 2.8.0
- **Transformers**: 4.53.0
- **Operating System(s):** Linux
- **Inference Engine:** [SGLang](https://docs.sglang.ai/)/[vLLM](https://docs.vllm.ai/en/latest/)
- **Model Optimizer:** [AMD-Quark](https://quark.docs.amd.com/latest/index.html) (V0.10)
  - **Weight quantization:** OCP MXFP4, Static
  - **Activation quantization:** OCP MXFP4, Dynamic
- **Calibration Dataset:** [Pile](https://huggingface.co/datasets/mit-han-lab/pile-val-backup)

This model was built with deepseek-ai DeepSeek-R1-0528 model by applying [AMD-Quark](https://quark.docs.amd.com/latest/index.html) for MXFP4 quantization.

# Model Quantization

The model was quantized from [deepseek-ai/DeepSeek-R1-0528](https://huggingface.co/deepseek-ai/DeepSeek-R1-0528) using [AMD-Quark](https://quark.docs.amd.com/latest/index.html). Both weights and activations were quantized to MXFP4 format, and the AutoSmoothQuant algorithm was applied to enhance accuracy. 

**Preprocessing requirement:**

Before executing the quantization script below, the original FP8 model must first be dequantized to BFloat16.
You can either perform the dequantization manually using this [conversion script](https://github.com/deepseek-ai/DeepSeek-V3/blob/main/inference/fp8_cast_bf16.py), or use the pre-converted BFloat16 model available at [unsloth/DeepSeek-R1-0528-BF16](https://huggingface.co/unsloth/DeepSeek-R1-0528-BF16).

**Quantization scripts:**
```
cd Quark/examples/torch/language_modeling/llm_ptq/
exclude_layers="*self_attn* *mlp.gate.* *lm_head"
python3 quantize_quark.py --model_dir $MODEL_DIR \
                          --quant_scheme w_mxfp4_a_mxfp4 \
                          --num_calib_data 128 \
                          --exclude_layers $exclude_layers \
                          --skip_evaluation \
                          --multi_gpu \
                          --quant_algo autosmoothquant \
                          --model_export hf_format \
                          --output_dir amd/DeepSeek-R1-0528-MXFP4-ASQ
```

# Deployment

This model can be deployed efficiently using the [SGLang](https://docs.sglang.ai/) and [vLLM](https://docs.vllm.ai/en/latest/) backends.

## Evaluation

The model was evaluated on AIME24, GPQA Diamond, MATH-500, and GSM8K benchmarks. The tasks of AIME24, GPQA Diamond, and MATH-500 were conducted using [lighteval](https://github.com/huggingface/lighteval/tree/v0.10.0) with 10 rounds for different generation seeds. GSM8K was conducted using [lm-eval-harness](https://github.com/EleutherAI/lm-evaluation-harness).

### Accuracy

<table>
  <tr>
   <td><strong>Benchmark</strong>
   </td>
   <td><strong>DeepSeek-R1-0528 </strong>
   </td>
   <td><strong>DeepSeek-R1-0528-MXFP4-ASQ(this model)</strong>
   </td>
   <td><strong>Recovery</strong>
   </td>
  </tr>
  <tr>
   <td>AIME24 
   </td>
   <td>88.00
   </td>
   <td>87.67
   </td>
   <td>99.62%
   </td>
  </tr>
  <tr>
   <td>GPQA Diamond 
   </td>
   <td>79.90
   </td>
   <td>79.65
   </td>
   <td>99.69%
   </td>
  </tr>
  <tr>
   <td>MATH-500 
   </td>
   <td>97.06
   </td>
   <td>96.90
   </td>
   <td>99.84%
   </td>
  </tr>
  <tr>
   <td>GSM8K 
   </td>
   <td>95.30
   </td>
   <td>95.18
   </td>
   <td>99.87%
   </td>
  </tr>
</table>


### Reproduction

The results of AIME24, MATH-500 and GPQA Diamond were obtained using forked [lighteval](https://github.com/zhaolin-amd/lighteval/tree/v0.10-release-custom) and vLLM docker (emulation qdq) `rocm/vllm-private:pytorch-vllm-gfx950-mxfp4-mxfp6-v3`.

```
# Set docker env
export VLLM_QUARK_F4F6_OFFLINE_DEQUANT_TMPENVVAR=1

# Set model args
MODEL_ARGS="model_name=amd/DeepSeek-R1-0528-MXFP4-ASQ,dtype=bfloat16,tensor_parallel_size=8,max_model_length=71536,max_num_batched_tokens=32768,gpu_memory_utilization=0.85,generation_parameters={max_new_tokens:65536,temperature:0.6,top_p:0.95,seed:$SEED}"
OUTPUT_DIR="results/DeepSeek-R1-0528-MXFP4-ASQ-Seed"
LOG="logs/deepseek_0528_maxfp4.log"

# Evaluating 10 rounds 
for i in $(seq 1 10); do
    # seed in [0, 2**30 - 1]
    SEED=$(shuf -i 0-1073741823 -n 1)

    lighteval vllm $MODEL_ARGS "custom|aime24_single|0|0,custom|math_500_single|0|0,custom|gpqa:diamond_single|0|0" \
        --use-chat-template \
        --output-dir "$OUTPUT_DIR/seed_$SEED" \
        2>&1 | tee -a "$LOG"
```

The result of GSM8K was obtained using [lm-eval-harness](https://github.com/EleutherAI/lm-evaluation-harness) and SGLang [docker](https://hub.docker.com/layers/lmsysorg/sglang/v0.5.3.post3-rocm700-mi35x-srt/images/sha256-8c7281fcd4adc7942c7e674d464fee322d1775d7b546596ab4cc7edd258517fc).

```
# Launching server
SGLANG_USE_AITER=1 python -m sglang.launch_server \
    --model-path $MODEL_DIR \
    --tp 8 \
    --port 8000 \
    --attention-backend aiter

# Evaluating
MODEL_ARGS="model=amd/DeepSeek-R1-0528-MXFP4-ASQ,base_url=http://localhost:8000/v1/completions,num_concurrent=999999,timeout=999999,tokenized_requests=False,max_length=38768,temperature=0.6,top_p=0.95,add_bos_token=True,seed=$SEED"
lm_eval \
    --model local-completions \
    --model_args $MODEL_ARGS \
    --tasks gsm8k \
    --num_fewshot 8 \
    --batch_size auto
```


# License
Modifications Copyright(c) 2025 Advanced Micro Devices, Inc. All rights reserved.