Files changed (1) hide show
  1. README.md +26 -1
README.md CHANGED
@@ -39,14 +39,16 @@ For more details, including benchmark evaluation, hardware requirements, and inf
39
 
40
  ## Quickstart
41
 
42
- We advise you to use the latest version of `transformers`.
43
 
44
  With `transformers<4.51.0`, you will encounter the following error:
 
45
  ```
46
  KeyError: 'qwen3_moe'
47
  ```
48
 
49
  The following contains a code snippet illustrating how to use the model generate content based on given inputs.
 
50
  ```python
51
  from transformers import AutoModelForCausalLM, AutoTokenizer
52
 
@@ -84,6 +86,29 @@ content = tokenizer.decode(output_ids, skip_special_tokens=True)
84
  print("content:", content)
85
  ```
86
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
87
  **Note: If you encounter out-of-memory (OOM) issues, consider reducing the context length to a shorter value, such as `32,768`.**
88
 
89
  For local use, applications such as Ollama, LMStudio, MLX-LM, llama.cpp, and KTransformers have also supported Qwen3.
 
39
 
40
  ## Quickstart
41
 
42
+ We advise you to use the latest version of `transformers` and SGLang.
43
 
44
  With `transformers<4.51.0`, you will encounter the following error:
45
+
46
  ```
47
  KeyError: 'qwen3_moe'
48
  ```
49
 
50
  The following contains a code snippet illustrating how to use the model generate content based on given inputs.
51
+
52
  ```python
53
  from transformers import AutoModelForCausalLM, AutoTokenizer
54
 
 
86
  print("content:", content)
87
  ```
88
 
89
+ To serve Qwen3 model on 4/8xH100/200 GPUs with SGLang:
90
+
91
+ For the BF16 model:
92
+
93
+ ```bash
94
+ python3 -m sglang.launch_server --model-path Qwen/Qwen3-Coder-480B-A35B --tp 8 --tool-call-parser qwen3
95
+ ```
96
+
97
+ For FP8 model:
98
+
99
+ ```bash
100
+ python3 -m sglang.launch_server --model-path Qwen/Qwen3-Coder-480B-A35B-FP8 --tp 4 --tool-call-parser qwen3
101
+ ```
102
+
103
+ or
104
+
105
+ ```bash
106
+ python3 -m sglang.launch_server --model-path Qwen/Qwen3-Coder-480B-A35B-FP8 --tp 8 --enable-ep-moe --tool-call-parser qwen3
107
+ ```
108
+
109
+ * **FP8 models** : With --tp 8 Loading failure is expected; switch to expert-parallel mode using ```--enable-ep-moe```.
110
+ * **Tool call**: Add ```--tool-call-parser qwen3``` for tool call parser.
111
+
112
  **Note: If you encounter out-of-memory (OOM) issues, consider reducing the context length to a shorter value, such as `32,768`.**
113
 
114
  For local use, applications such as Ollama, LMStudio, MLX-LM, llama.cpp, and KTransformers have also supported Qwen3.