Update README.md
Browse files
README.md
CHANGED
@@ -12,18 +12,18 @@ base_model:
|
|
12 |
- Qwen/Qwen3-235B-A22B
|
13 |
base_model_relation: quantized
|
14 |
---
|
15 |
-
#
|
16 |
-
|
17 |
|
18 |
-
###
|
19 |
```
|
20 |
2025-05-09
|
21 |
-
1.
|
22 |
-
2.
|
23 |
-
3.
|
24 |
```
|
25 |
|
26 |
-
###
|
27 |
|
28 |
```
|
29 |
vllm==0.8.5
|
@@ -37,20 +37,20 @@ transformers==4.51.3
|
|
37 |
border: 1px solid rgba(255, 165, 0, 0.3);
|
38 |
margin: 16px 0;
|
39 |
">
|
40 |
-
###
|
41 |
|
42 |
-
#### 1.
|
43 |
-
|
44 |
```
|
45 |
export VLLM_USE_V1=0
|
46 |
```
|
47 |
|
48 |
-
#### 2.
|
49 |
-
|
50 |
|
51 |
```.../vllm/model_executor/layers/quantization/gptq_marlin.py```
|
52 |
|
53 |
-
|
54 |
```
|
55 |
raise NotImplementedError(
|
56 |
NotImplementedError: Apply router weight on input is not supported forfused Marlin MoE method.
|
@@ -64,13 +64,13 @@ NotImplementedError: Apply router weight on input is not supported forfused Marl
|
|
64 |
border: 1px solid rgba(255, 0, 200, 0.3);
|
65 |
margin: 16px 0;
|
66 |
">
|
67 |
-
###
|
68 |
|
69 |
-
#### 1.
|
70 |
-
|
71 |
```commandline
|
72 |
vllm serve \
|
73 |
-
|
74 |
--served-model-name Qwen3-235B-A22B-GPTQ-Int8 \
|
75 |
--max-num-seqs 8 \
|
76 |
--max-model-len 32768 \
|
@@ -84,22 +84,23 @@ vllm serve \
|
|
84 |
</div>
|
85 |
|
86 |
|
87 |
-
###
|
88 |
|
89 |
-
|
|
90 |
|---------|--------------|
|
91 |
| `226GB` | `2025-05-09` |
|
92 |
|
93 |
|
94 |
|
95 |
-
###
|
96 |
|
97 |
```python
|
98 |
-
from
|
99 |
-
snapshot_download('
|
100 |
```
|
101 |
|
102 |
|
|
|
103 |
### 【介绍】
|
104 |
# Qwen3-235B-A22B
|
105 |
<a href="https://chat.qwen.ai/" target="_blank" style="margin: 2px;">
|
|
|
12 |
- Qwen/Qwen3-235B-A22B
|
13 |
base_model_relation: quantized
|
14 |
---
|
15 |
+
# Qwen3-235B-A22B-GPTQ-Int8
|
16 |
+
Base Model [Qwen/Qwen3-235B-A22B](https://www.modelscope.cn/models/Qwen/Qwen3-235B-A22B)
|
17 |
|
18 |
+
### 【Model Update Date】
|
19 |
```
|
20 |
2025-05-09
|
21 |
+
1. fast commit
|
22 |
+
2. Confirmed support for launching with 8 GPUs using `tensor-parallel-size` + `expert-parallel`
|
23 |
+
3. Must be launched with `gptq_marlin`; does not support Compute 7 GPUs: vLLM has not implemented native GPTQ MoE module
|
24 |
```
|
25 |
|
26 |
+
### 【Dependencies】
|
27 |
|
28 |
```
|
29 |
vllm==0.8.5
|
|
|
37 |
border: 1px solid rgba(255, 165, 0, 0.3);
|
38 |
margin: 16px 0;
|
39 |
">
|
40 |
+
### 【💡Notes on New VLLM MoE Versions💡】
|
41 |
|
42 |
+
#### 1. V0 Inference Mode is Required
|
43 |
+
Before launching vLLM, set the following environment variable
|
44 |
```
|
45 |
export VLLM_USE_V1=0
|
46 |
```
|
47 |
|
48 |
+
#### 2. A Small Bug Exists in gptq_marlin.py and Requires Patching
|
49 |
+
Replace the file in your installation with the attached version at:
|
50 |
|
51 |
```.../vllm/model_executor/layers/quantization/gptq_marlin.py```
|
52 |
|
53 |
+
Otherwise, you may encounter the following error:
|
54 |
```
|
55 |
raise NotImplementedError(
|
56 |
NotImplementedError: Apply router weight on input is not supported forfused Marlin MoE method.
|
|
|
64 |
border: 1px solid rgba(255, 0, 200, 0.3);
|
65 |
margin: 16px 0;
|
66 |
">
|
67 |
+
### 【💡Notes on Qwen3-235B-A22B💡】
|
68 |
|
69 |
+
#### 1. When launching vLLM, remember to enable expert parallelism (--enable-expert-parallel), otherwise multi-GPU launch on a single node (e.g., 8 GPUs) will fail.
|
70 |
+
Example Launch Command:
|
71 |
```commandline
|
72 |
vllm serve \
|
73 |
+
QuantTrio/Qwen3-235B-A22B-GPTQ-Int8 \
|
74 |
--served-model-name Qwen3-235B-A22B-GPTQ-Int8 \
|
75 |
--max-num-seqs 8 \
|
76 |
--max-model-len 32768 \
|
|
|
84 |
</div>
|
85 |
|
86 |
|
87 |
+
### 【Model List】
|
88 |
|
89 |
+
| FILE SIZE | LATEST UPDATE TIME |
|
90 |
|---------|--------------|
|
91 |
| `226GB` | `2025-05-09` |
|
92 |
|
93 |
|
94 |
|
95 |
+
### 【Model Download】
|
96 |
|
97 |
```python
|
98 |
+
from huggingface_hub import snapshot_download
|
99 |
+
snapshot_download('QuantTrio/Qwen3-235B-A22B-GPTQ-Int8', cache_dir="local_path")
|
100 |
```
|
101 |
|
102 |
|
103 |
+
|
104 |
### 【介绍】
|
105 |
# Qwen3-235B-A22B
|
106 |
<a href="https://chat.qwen.ai/" target="_blank" style="margin: 2px;">
|