lokinfey
/

Qwen3-14B-ONNX-INT4-CPU

ONNX

qwen3

Model card Files Files and versions

xet

Community

lokinfey commited on Jun 29

Commit

2ef2e77

verified ·

1 Parent(s): 02145c5

Update README.md

Browse files

Files changed (1) hide show

README.md +118 -3

README.md CHANGED Viewed

@@ -1,3 +1,118 @@
----
-license: mit
----

+---
+license: mit
+---
+## Qwen3-14B-ONNX-INT4-CPU
+**Note:** This is an unofficial version, intended for testing and development purposes only.
+### **Model Conversion**
+This guide demonstrates how to convert the Qwen3-14B model to ONNX format using Microsoft Olive and Microsoft ONNXRuntime GenAI.
+#### Prerequisites
+Ensure you have the following tools installed:
+1. CMake 3.31+
+2. Transformers 4.51+
+3. Microsoft Olive (development version)
+4. Microsoft ONNXRuntime GenAI (development version)
+#### Installation Steps
+```bash
+# Update your transformers library
+pip install transformers -U
+# Install Microsoft Olive
+pip install git+https://github.com/microsoft/Olive.git
+# Install Microsoft ONNXRuntime GenAI
+git clone https://github.com/microsoft/onnxruntime-genai
+cd onnxruntime-genai && python build.py --config Release
+```
+**Note:** If you don't have CMake installed, please install it first before proceeding.
+#### Conversion Command
+```bash
+# Convert the model using Microsoft Olive
+olive auto-opt \
+    --model_name_or_path {Qwen3-14B_PATH} \
+    --device cpu \
+    --provider CPUExecutionProvider \
+    --use_model_builder \
+    --precision int4 \
+    --output_path {Your_Qwen3-0.6B_ONNX_Output_Path} \
+    --log_level 1
+```
+### **Inference**
+Qwen3 supports two inference modes with different parameter configurations:
+#### Thinking Mode
+When you want the model to show its reasoning process:
+- **Parameters:** Temperature=0.6, TopP=0.95, TopK=20, MinP=0.0
+- **Chat Template:** `<|im_start|>user\n/think {input}<|im_end|><|im_start|>assistant\n`
+#### Non-Thinking Mode
+For direct responses without showing reasoning:
+- **Parameters:** Temperature=0.7, TopP=0.8, TopK=20, MinP=0.0
+- **Chat Template:** `<|im_start|>user\n/no_think {input}<|im_end|><|im_start|>assistant\n`
+#### Python Example
+```python
+import onnxruntime_genai as og
+import json
+# Set your model path
+model_folder = "Your_Qwen3-14B_ONNX_Path"
+# Initialize model and tokenizer
+model = og.Model(model_folder)
+tokenizer = og.Tokenizer(model)
+tokenizer_stream = tokenizer.create_stream()
+# Configuration for thinking mode
+search_options = {
+    'temperature': 0.6,
+    'top_p': 0.95,
+    'top_k': 20,
+    'max_length': 32768,
+    'repetition_penalty': 1
+}
+chat_template = "<|im_start|>user\n/think {input}<|im_end|><|im_start|>assistant\n"
+text = 'What is the derivative of x^2?'
+# Alternative configuration for non-thinking mode
+# search_options = {
+#     'temperature': 0.7,
+#     'top_p': 0.8,
+#     'top_k': 20,
+#     'max_length': 4096,
+#     'repetition_penalty': 1
+# }
+# chat_template = "<|im_start|>user\n/no_think {input}<|im_end|><|im_start|>assistant\n"
+# text = 'Can you introduce yourself?'
+# Prepare the prompt and generate response
+prompt = chat_template.format(input=text)
+input_tokens = tokenizer.encode(prompt)
+params = og.GeneratorParams(model)
+params.set_search_options(**search_options)
+generator = og.Generator(model, params)
+generator.append_tokens(input_tokens)
+# Generate and stream the response
+while not generator.is_done():
+    generator.generate_next_token()
+    new_token = generator.get_next_tokens()[0]
+    print(tokenizer_stream.decode(new_token), end='', flush=True)
+```