lokinfey/Qwen3-0.6B-ONNX-INT4-CPU

Qwen3-0.6B-ONNX-INT4-CPU

Note: This is an unofficial version, intended for testing and development purposes only.

Model Conversion

This guide demonstrates how to convert the Qwen3-0.6B model to ONNX format using Microsoft Olive and Microsoft ONNXRuntime GenAI.

Prerequisites

Ensure you have the following tools installed:

CMake 3.31+
Transformers 4.51+
Microsoft Olive (development version)
Microsoft ONNXRuntime GenAI (development version)

Installation Steps

# Update your transformers library
pip install transformers -U

# Install Microsoft Olive
pip install git+https://github.com/microsoft/Olive.git

# Install Microsoft ONNXRuntime GenAI
git clone https://github.com/microsoft/onnxruntime-genai
cd onnxruntime-genai && python build.py --config Release

Note: If you don't have CMake installed, please install it first before proceeding.

Conversion Command

# Convert the model using Microsoft Olive
olive auto-opt \
    --model_name_or_path {Qwen3-0.6B_PATH} \
    --device cpu \
    --provider CPUExecutionProvider \
    --use_model_builder \
    --precision int4 \
    --output_path {Your_Qwen3-0.6B_ONNX_Output_Path} \
    --log_level 1

Inference

Qwen3 supports two inference modes with different parameter configurations:

Thinking Mode

When you want the model to show its reasoning process:

Parameters: Temperature=0.6, TopP=0.95, TopK=20, MinP=0.0
Chat Template: <|im_start|>user\n/think {input}<|im_end|><|im_start|>assistant\n

Non-Thinking Mode

For direct responses without showing reasoning:

Parameters: Temperature=0.7, TopP=0.8, TopK=20, MinP=0.0
Chat Template: <|im_start|>user\n/no_think {input}<|im_end|><|im_start|>assistant\n

Python Example

import onnxruntime_genai as og
import json

# Set your model path
model_folder = "Your_Qwen3-0.6B_ONNX_Path"

# Initialize model and tokenizer
model = og.Model(model_folder)
tokenizer = og.Tokenizer(model)
tokenizer_stream = tokenizer.create_stream()

# Configuration for thinking mode
search_options = {
    'temperature': 0.6,
    'top_p': 0.95,
    'top_k': 20,
    'max_length': 32768,
    'repetition_penalty': 1
}
chat_template = "<|im_start|>user\n/think {input}<|im_end|><|im_start|>assistant\n"
text = 'What is the derivative of x^2?'

# Alternative configuration for non-thinking mode
# search_options = {
#     'temperature': 0.7,
#     'top_p': 0.8,
#     'top_k': 20,
#     'max_length': 4096,
#     'repetition_penalty': 1
# }
# chat_template = "<|im_start|>user\n/no_think {input}<|im_end|><|im_start|>assistant\n"
# text = 'Can you introduce yourself?'

# Prepare the prompt and generate response
prompt = chat_template.format(input=text)
input_tokens = tokenizer.encode(prompt)

params = og.GeneratorParams(model)
params.set_search_options(**search_options)
generator = og.Generator(model, params)

generator.append_tokens(input_tokens)

# Generate and stream the response
while not generator.is_done():
    generator.generate_next_token()
    new_token = generator.get_next_tokens()[0]
    print(tokenizer_stream.decode(new_token), end='', flush=True)

lokinfey
/

Qwen3-0.6B-ONNX-INT4-CPU