Qwen3-Family
Collection
Quantifying and transforming models from the Qwen3-Family
•
5 items
•
Updated
Note: This is an unofficial version, intended for testing and development purposes only.
This guide demonstrates how to convert the Qwen3-0.6B model to ONNX format using Microsoft Olive and Microsoft ONNXRuntime GenAI.
Ensure you have the following tools installed:
# Update your transformers library
pip install transformers -U
# Install Microsoft Olive
pip install git+https://github.com/microsoft/Olive.git
# Install Microsoft ONNXRuntime GenAI
git clone https://github.com/microsoft/onnxruntime-genai
cd onnxruntime-genai && python build.py --config Release
Note: If you don't have CMake installed, please install it first before proceeding.
# Convert the model using Microsoft Olive
olive auto-opt \
--model_name_or_path {Qwen3-0.6B_PATH} \
--device cpu \
--provider CPUExecutionProvider \
--use_model_builder \
--precision int4 \
--output_path {Your_Qwen3-0.6B_ONNX_Output_Path} \
--log_level 1
Qwen3 supports two inference modes with different parameter configurations:
When you want the model to show its reasoning process:
<|im_start|>user\n/think {input}<|im_end|><|im_start|>assistant\n
For direct responses without showing reasoning:
<|im_start|>user\n/no_think {input}<|im_end|><|im_start|>assistant\n
import onnxruntime_genai as og
import json
# Set your model path
model_folder = "Your_Qwen3-0.6B_ONNX_Path"
# Initialize model and tokenizer
model = og.Model(model_folder)
tokenizer = og.Tokenizer(model)
tokenizer_stream = tokenizer.create_stream()
# Configuration for thinking mode
search_options = {
'temperature': 0.6,
'top_p': 0.95,
'top_k': 20,
'max_length': 32768,
'repetition_penalty': 1
}
chat_template = "<|im_start|>user\n/think {input}<|im_end|><|im_start|>assistant\n"
text = 'What is the derivative of x^2?'
# Alternative configuration for non-thinking mode
# search_options = {
# 'temperature': 0.7,
# 'top_p': 0.8,
# 'top_k': 20,
# 'max_length': 4096,
# 'repetition_penalty': 1
# }
# chat_template = "<|im_start|>user\n/no_think {input}<|im_end|><|im_start|>assistant\n"
# text = 'Can you introduce yourself?'
# Prepare the prompt and generate response
prompt = chat_template.format(input=text)
input_tokens = tokenizer.encode(prompt)
params = og.GeneratorParams(model)
params.set_search_options(**search_options)
generator = og.Generator(model, params)
generator.append_tokens(input_tokens)
# Generate and stream the response
while not generator.is_done():
generator.generate_next_token()
new_token = generator.get_next_tokens()[0]
print(tokenizer_stream.decode(new_token), end='', flush=True)