Qwen
/

Qwen3-Coder-480B-A35B-Instruct

Text Generation

Model card Files Files and versions

Update README.md

#5

by zhaochenyang20 - opened 8 days ago

base: refs/heads/main

←

from: refs/pr/5

Discussion Files changed

Files changed (1) hide show

README.md +26 -1

README.md CHANGED Viewed

@@ -39,14 +39,16 @@ For more details, including benchmark evaluation, hardware requirements, and inf
 ## Quickstart
-We advise you to use the latest version of `transformers`.
 With `transformers<4.51.0`, you will encounter the following error:
 ```
 KeyError: 'qwen3_moe'
 ```
 The following contains a code snippet illustrating how to use the model generate content based on given inputs.
 ```python
 from transformers import AutoModelForCausalLM, AutoTokenizer
@@ -84,6 +86,29 @@ content = tokenizer.decode(output_ids, skip_special_tokens=True)
 print("content:", content)
 ```
 **Note: If you encounter out-of-memory (OOM) issues, consider reducing the context length to a shorter value, such as `32,768`.**
 For local use, applications such as Ollama, LMStudio, MLX-LM, llama.cpp, and KTransformers have also supported Qwen3.

 ## Quickstart
+We advise you to use the latest version of `transformers` and SGLang.
 With `transformers<4.51.0`, you will encounter the following error:
 ```
 KeyError: 'qwen3_moe'
 ```
 The following contains a code snippet illustrating how to use the model generate content based on given inputs.
 ```python
 from transformers import AutoModelForCausalLM, AutoTokenizer
 print("content:", content)
 ```
+To serve Qwen3 model on 4/8xH100/200 GPUs with SGLang:
+For the BF16 model:
+```bash
+python3 -m sglang.launch_server --model-path Qwen/Qwen3-Coder-480B-A35B --tp 8 --tool-call-parser qwen3
+```
+For FP8 model:
+```bash
+python3 -m sglang.launch_server --model-path Qwen/Qwen3-Coder-480B-A35B-FP8 --tp 4 --tool-call-parser qwen3
+```
+or
+```bash
+python3 -m sglang.launch_server --model-path Qwen/Qwen3-Coder-480B-A35B-FP8 --tp 8 --enable-ep-moe --tool-call-parser qwen3
+```
+* **FP8 models** : With --tp 8 Loading failure is expected; switch to expert-parallel mode using ```--enable-ep-moe```.
+* **Tool call**: Add ```--tool-call-parser qwen3``` for tool call parser.
 **Note: If you encounter out-of-memory (OOM) issues, consider reducing the context length to a shorter value, such as `32,768`.**
 For local use, applications such as Ollama, LMStudio, MLX-LM, llama.cpp, and KTransformers have also supported Qwen3.