onnxruntime/DeepSeek-R1-Distill-ONNX

DeepSeek-R1-Distill-Qwen ONNX models

This repository hosts the optimized versions of DeepSeek-R1-Distill-Qwen-1.5B and DeepSeek-R1-Distill-Qwen-7B to accelerate inference with ONNX Runtime. Optimized models are published here in ONNX format to run with ONNX Runtime on CPU and GPU across devices, including server platforms, Windows, Linux and Mac desktops, and mobile CPUs, with the precision best suited to each of these targets.

To easily get started with the model, you can use our ONNX Runtime Generate() API. See instructions here

For CPU:

# Download the model directly using the Hugging Face CLI
huggingface-cli download onnxruntime/DeepSeek-R1-Distill-ONNX --include deepseek-r1-distill-qwen-1.5B/cpu_and_mobile/* --local-dir .

# Install the CPU package of ONNX Runtime GenAI
pip install onnxruntime-genai

# Please adjust the model directory (-m) accordingly 
curl -o https://raw.githubusercontent.com/microsoft/onnxruntime-genai/refs/heads/main/examples/python/model-chat.py
python model-chat.py -m /path/to/cpu-int4-rtn-block-32-acc-level-4/ -e cpu --chat_template "<|begin▁of▁sentence|><|User|>{input}<|Assistant|>"

For CUDA:

# Download the model directly using the Hugging Face CLI
huggingface-cli download onnxruntime/DeepSeek-R1-Distill-ONNX --include deepseek-r1-distill-qwen-1.5B/gpu/* --local-dir .

# Install the CUDA package of ONNX Runtime GenAI
pip install onnxruntime-genai-cuda

# Please adjust the model directory (-m) accordingly 
curl -o https://raw.githubusercontent.com/microsoft/onnxruntime-genai/refs/heads/main/examples/python/model-chat.py
python model-chat.py -m /path/to/gpu-int4-rtn-block-32/ -e cuda --chat_template "<|begin▁of▁sentence|><|User|>{input}<|Assistant|>"

For DirectML:

# Download the model directly using the Hugging Face CLI
huggingface-cli download onnxruntime/DeepSeek-R1-Distill-ONNX --include deepseek-r1-distill-qwen-1.5B/gpu/* --local-dir .

# Install the DirectML package of ONNX Runtime GenAI
pip install onnxruntime-genai-directml

# Please adjust the model directory (-m) accordingly 
curl -o https://raw.githubusercontent.com/microsoft/onnxruntime-genai/refs/heads/main/examples/python/model-chat.py
python model-chat.py -m /path/to/gpu-int4-rtn-block-32/ -e dml --chat_template "<|begin▁of▁sentence|><|User|>{input}<|Assistant|>"

ONNX Models

Here are some of the optimized configurations we have added:

ONNX model for CPU and mobile using int4 quantization via RTN.
ONNX model for GPU using int4 quantization via RTN.

Performance

ONNX enables you to run your models on-device across CPU, GPU, NPU. With ONNX, you can run your models on any machine across all silica (Qualcomm, AMD, Intel, Nvidia, etc).

See the table below for some key benchmarks for Windows GPU and CPU devices that the ONNX models were tested on.

Model	Precisionl	Device Type	Execution Provider	Device	Token Generation Throughput	Speed up vs base model
deepseek-ai_DeepSeek-R1-Distill-Qwen-1.5B	ONNX	fp16	CUDA	RTX 4090	197.195	4X
deepseek-ai_DeepSeek-R1-Distill-Qwen-1.5B	ONNX	int4	CUDA	RTX 4090	313.32	6.3X
deepseek-ai_DeepSeek-R1-Distill-Qwen-1.5B	ONNX	int4	CPU	Intel i9	11.749	1.4x
deepseek-ai_DeepSeek-R1-Distill-Qwen-7B	ONNX	fp16	CUDA	RTX 4090	57.316	1.3X
deepseek-ai_DeepSeek-R1-Distill-Qwen-7B	ONNX	int4	CUDA	RTX 4090	161.00	3.7X
deepseek-ai_DeepSeek-R1-Distill-Qwen-7B	ONNX	int4	CPU	Intel i9	3.184	20X

CPU build specs:

onnxruntime-genai==0.6.0-dev
transformers==4.46.2
onnxruntime==1.20.01

CUDA build specs:

onnxruntime-genai-cuda==0.6.0-dev
transformers==4.46.2
onnxruntime-gpu==1.20.1

Model Description

Developed by: ONNX Runtime
Model type: ONNX
Language(s) (NLP): Python, C, C++
License: MIT
Model Description: This is a conversion of the Deepseek R1 for ONNX Runtime inference.
Disclaimer: Model is only an optimization of the base model, any risk associated with the model is the responsibility of the user of the model. Please verify and test for you scenarios. There may be a slight difference in output from the base model with the optimizations applied. **

Base Model Information

See HF links DeepSeek-R1-Distill-Qwen-1.5B and DeepSeek-R1-Distill-Qwen-7B for details.