lokinfey commited on
Commit
2ef2e77
·
verified ·
1 Parent(s): 02145c5

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +118 -3
README.md CHANGED
@@ -1,3 +1,118 @@
1
- ---
2
- license: mit
3
- ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ license: mit
3
+ ---
4
+
5
+ ## Qwen3-14B-ONNX-INT4-CPU
6
+
7
+ **Note:** This is an unofficial version, intended for testing and development purposes only.
8
+
9
+ ### **Model Conversion**
10
+
11
+ This guide demonstrates how to convert the Qwen3-14B model to ONNX format using Microsoft Olive and Microsoft ONNXRuntime GenAI.
12
+
13
+ #### Prerequisites
14
+
15
+ Ensure you have the following tools installed:
16
+
17
+ 1. CMake 3.31+
18
+ 2. Transformers 4.51+
19
+ 3. Microsoft Olive (development version)
20
+ 4. Microsoft ONNXRuntime GenAI (development version)
21
+
22
+ #### Installation Steps
23
+
24
+ ```bash
25
+ # Update your transformers library
26
+ pip install transformers -U
27
+
28
+ # Install Microsoft Olive
29
+ pip install git+https://github.com/microsoft/Olive.git
30
+
31
+ # Install Microsoft ONNXRuntime GenAI
32
+ git clone https://github.com/microsoft/onnxruntime-genai
33
+ cd onnxruntime-genai && python build.py --config Release
34
+ ```
35
+
36
+ **Note:** If you don't have CMake installed, please install it first before proceeding.
37
+
38
+ #### Conversion Command
39
+
40
+ ```bash
41
+ # Convert the model using Microsoft Olive
42
+ olive auto-opt \
43
+ --model_name_or_path {Qwen3-14B_PATH} \
44
+ --device cpu \
45
+ --provider CPUExecutionProvider \
46
+ --use_model_builder \
47
+ --precision int4 \
48
+ --output_path {Your_Qwen3-0.6B_ONNX_Output_Path} \
49
+ --log_level 1
50
+ ```
51
+
52
+
53
+ ### **Inference**
54
+
55
+ Qwen3 supports two inference modes with different parameter configurations:
56
+
57
+ #### Thinking Mode
58
+ When you want the model to show its reasoning process:
59
+ - **Parameters:** Temperature=0.6, TopP=0.95, TopK=20, MinP=0.0
60
+ - **Chat Template:** `<|im_start|>user\n/think {input}<|im_end|><|im_start|>assistant\n`
61
+
62
+ #### Non-Thinking Mode
63
+ For direct responses without showing reasoning:
64
+ - **Parameters:** Temperature=0.7, TopP=0.8, TopK=20, MinP=0.0
65
+ - **Chat Template:** `<|im_start|>user\n/no_think {input}<|im_end|><|im_start|>assistant\n`
66
+
67
+ #### Python Example
68
+
69
+ ```python
70
+ import onnxruntime_genai as og
71
+ import json
72
+
73
+ # Set your model path
74
+ model_folder = "Your_Qwen3-14B_ONNX_Path"
75
+
76
+ # Initialize model and tokenizer
77
+ model = og.Model(model_folder)
78
+ tokenizer = og.Tokenizer(model)
79
+ tokenizer_stream = tokenizer.create_stream()
80
+
81
+ # Configuration for thinking mode
82
+ search_options = {
83
+ 'temperature': 0.6,
84
+ 'top_p': 0.95,
85
+ 'top_k': 20,
86
+ 'max_length': 32768,
87
+ 'repetition_penalty': 1
88
+ }
89
+ chat_template = "<|im_start|>user\n/think {input}<|im_end|><|im_start|>assistant\n"
90
+ text = 'What is the derivative of x^2?'
91
+
92
+ # Alternative configuration for non-thinking mode
93
+ # search_options = {
94
+ # 'temperature': 0.7,
95
+ # 'top_p': 0.8,
96
+ # 'top_k': 20,
97
+ # 'max_length': 4096,
98
+ # 'repetition_penalty': 1
99
+ # }
100
+ # chat_template = "<|im_start|>user\n/no_think {input}<|im_end|><|im_start|>assistant\n"
101
+ # text = 'Can you introduce yourself?'
102
+
103
+ # Prepare the prompt and generate response
104
+ prompt = chat_template.format(input=text)
105
+ input_tokens = tokenizer.encode(prompt)
106
+
107
+ params = og.GeneratorParams(model)
108
+ params.set_search_options(**search_options)
109
+ generator = og.Generator(model, params)
110
+
111
+ generator.append_tokens(input_tokens)
112
+
113
+ # Generate and stream the response
114
+ while not generator.is_done():
115
+ generator.generate_next_token()
116
+ new_token = generator.get_next_tokens()[0]
117
+ print(tokenizer_stream.decode(new_token), end='', flush=True)
118
+ ```