JunHowie commited on
Commit
3f2f0c8
·
verified ·
1 Parent(s): afb6654

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +23 -22
README.md CHANGED
@@ -12,18 +12,18 @@ base_model:
12
  - Qwen/Qwen3-235B-A22B
13
  base_model_relation: quantized
14
  ---
15
- # 通义千问3-235B-A22B-GPTQ-Int8
16
- 基础型 [Qwen/Qwen3-235B-A22B](https://www.modelscope.cn/models/Qwen/Qwen3-235B-A22B)
17
 
18
- ### 【模型更新日期】
19
  ```
20
  2025-05-09
21
- 1. 首次commit
22
- 2. 确定支持8卡的`tensor-parallel-size` + `expert-parallel` 启动
23
- 3. 必须 gptq_marlin 启动;不支持 Compute 7 显卡启动:vllm没有实现原生GPTQ的moe模块。
24
  ```
25
 
26
- ### 【依赖】
27
 
28
  ```
29
  vllm==0.8.5
@@ -37,20 +37,20 @@ transformers==4.51.3
37
  border: 1px solid rgba(255, 165, 0, 0.3);
38
  margin: 16px 0;
39
  ">
40
- ### 【💡新版 VLLM MoE 注意事项💡】
41
 
42
- #### 1. 需使用V0推理模式
43
- 启动vllm之前,先设置环境变量
44
  ```
45
  export VLLM_USE_V1=0
46
  ```
47
 
48
- #### 2. `gptq_marlin.py` 存在小bug,需要补丁
49
- 将附件中的文件替换至
50
 
51
  ```.../vllm/model_executor/layers/quantization/gptq_marlin.py```
52
 
53
- 否则会出现下述报错
54
  ```
55
  raise NotImplementedError(
56
  NotImplementedError: Apply router weight on input is not supported forfused Marlin MoE method.
@@ -64,13 +64,13 @@ NotImplementedError: Apply router weight on input is not supported forfused Marl
64
  border: 1px solid rgba(255, 0, 200, 0.3);
65
  margin: 16px 0;
66
  ">
67
- ### 【💡通义千问3-235B-A22B 注意事项💡】
68
 
69
- #### 1. 启动vllm的时候,要记得使用专家并行模式(`--enable-expert-parallel`),否则不能单节点8卡启动。
70
- 启动示例:
71
  ```commandline
72
  vllm serve \
73
- tclf90/Qwen3-235B-A22B-GPTQ-Int8 \
74
  --served-model-name Qwen3-235B-A22B-GPTQ-Int8 \
75
  --max-num-seqs 8 \
76
  --max-model-len 32768 \
@@ -84,22 +84,23 @@ vllm serve \
84
  </div>
85
 
86
 
87
- ### 【模型列表】
88
 
89
- | 文件大小 | 最近更新时间 |
90
  |---------|--------------|
91
  | `226GB` | `2025-05-09` |
92
 
93
 
94
 
95
- ### 【模型下载】
96
 
97
  ```python
98
- from modelscope import snapshot_download
99
- snapshot_download('tclf90/Qwen3-235B-A22B-GPTQ-Int8', cache_dir="本地路径")
100
  ```
101
 
102
 
 
103
  ### 【介绍】
104
  # Qwen3-235B-A22B
105
  <a href="https://chat.qwen.ai/" target="_blank" style="margin: 2px;">
 
12
  - Qwen/Qwen3-235B-A22B
13
  base_model_relation: quantized
14
  ---
15
+ # Qwen3-235B-A22B-GPTQ-Int8
16
+ Base Model [Qwen/Qwen3-235B-A22B](https://www.modelscope.cn/models/Qwen/Qwen3-235B-A22B)
17
 
18
+ ### 【Model Update Date】
19
  ```
20
  2025-05-09
21
+ 1. fast commit
22
+ 2. Confirmed support for launching with 8 GPUs using `tensor-parallel-size` + `expert-parallel`
23
+ 3. Must be launched with `gptq_marlin`; does not support Compute 7 GPUs: vLLM has not implemented native GPTQ MoE module
24
  ```
25
 
26
+ ### 【Dependencies】
27
 
28
  ```
29
  vllm==0.8.5
 
37
  border: 1px solid rgba(255, 165, 0, 0.3);
38
  margin: 16px 0;
39
  ">
40
+ ### 【💡Notes on New VLLM MoE Versions💡】
41
 
42
+ #### 1. V0 Inference Mode is Required
43
+ Before launching vLLM, set the following environment variable
44
  ```
45
  export VLLM_USE_V1=0
46
  ```
47
 
48
+ #### 2. A Small Bug Exists in gptq_marlin.py and Requires Patching
49
+ Replace the file in your installation with the attached version at:
50
 
51
  ```.../vllm/model_executor/layers/quantization/gptq_marlin.py```
52
 
53
+ Otherwise, you may encounter the following error:
54
  ```
55
  raise NotImplementedError(
56
  NotImplementedError: Apply router weight on input is not supported forfused Marlin MoE method.
 
64
  border: 1px solid rgba(255, 0, 200, 0.3);
65
  margin: 16px 0;
66
  ">
67
+ ### 【💡Notes on Qwen3-235B-A22B💡】
68
 
69
+ #### 1. When launching vLLM, remember to enable expert parallelism (--enable-expert-parallel), otherwise multi-GPU launch on a single node (e.g., 8 GPUs) will fail.
70
+ Example Launch Command:
71
  ```commandline
72
  vllm serve \
73
+ QuantTrio/Qwen3-235B-A22B-GPTQ-Int8 \
74
  --served-model-name Qwen3-235B-A22B-GPTQ-Int8 \
75
  --max-num-seqs 8 \
76
  --max-model-len 32768 \
 
84
  </div>
85
 
86
 
87
+ ### 【Model List】
88
 
89
+ | FILE SIZE | LATEST UPDATE TIME |
90
  |---------|--------------|
91
  | `226GB` | `2025-05-09` |
92
 
93
 
94
 
95
+ ### 【Model Download】
96
 
97
  ```python
98
+ from huggingface_hub import snapshot_download
99
+ snapshot_download('QuantTrio/Qwen3-235B-A22B-GPTQ-Int8', cache_dir="local_path")
100
  ```
101
 
102
 
103
+
104
  ### 【介绍】
105
  # Qwen3-235B-A22B
106
  <a href="https://chat.qwen.ai/" target="_blank" style="margin: 2px;">