QuantTrio
/

Qwen3-235B-A22B-GPTQ-Int8

@@ -12,18 +12,18 @@ base_model:
   - Qwen/Qwen3-235B-A22B
 base_model_relation: quantized
 ---
-# 通义千问3-235B-A22B-GPTQ-Int8
-基础型 [Qwen/Qwen3-235B-A22B](https://www.modelscope.cn/models/Qwen/Qwen3-235B-A22B)
-### 【模型更新日期】
 ```
 2025-05-09
-1. 首次commit
-2. 确定支持8卡的`tensor-parallel-size` + `expert-parallel` 启动
-3. 必须 gptq_marlin 启动；不支持 Compute 7 显卡启动：vllm没有实现原生GPTQ的moe模块。
 ```
-### 【依赖】
 ```
 vllm==0.8.5
@@ -37,20 +37,20 @@ transformers==4.51.3
     border: 1px solid rgba(255, 165, 0, 0.3);
     margin: 16px 0;
 ">
-### 【💡新版 VLLM MoE 注意事项💡】
-#### 1. 需使用V0推理模式
-启动vllm之前，先设置环境变量
 ```
 export VLLM_USE_V1=0
 ```
-#### 2. `gptq_marlin.py` 存在小bug，需要补丁
-将附件中的文件替换至
 ```.../vllm/model_executor/layers/quantization/gptq_marlin.py```
-否则会出现下述报错
 ```
     raise NotImplementedError(
 NotImplementedError: Apply router weight on input is not supported forfused Marlin MoE method.
@@ -64,13 +64,13 @@ NotImplementedError: Apply router weight on input is not supported forfused Marl
     border: 1px solid rgba(255, 0, 200, 0.3);
     margin: 16px 0;
 ">
-### 【💡通义千问3-235B-A22B 注意事项💡】
-#### 1. 启动vllm的时候，要记得使用专家并行模式(`--enable-expert-parallel`)，否则不能单节点8卡启动。
-启动示例：
 ```commandline
 vllm serve \
-    tclf90/Qwen3-235B-A22B-GPTQ-Int8 \
     --served-model-name Qwen3-235B-A22B-GPTQ-Int8 \
     --max-num-seqs 8 \
     --max-model-len 32768 \
@@ -84,22 +84,23 @@ vllm serve \
 </div>
-### 【模型列表】
-| 文件大小    | 最近更新时间       |
 |---------|--------------|
 | `226GB` | `2025-05-09` |
-### 【模型下载】
 ```python
-from modelscope import snapshot_download
-snapshot_download('tclf90/Qwen3-235B-A22B-GPTQ-Int8', cache_dir="本地路径")
 ```
 ### 【介绍】
 # Qwen3-235B-A22B
 <a href="https://chat.qwen.ai/" target="_blank" style="margin: 2px;">

   - Qwen/Qwen3-235B-A22B
 base_model_relation: quantized
 ---
+# Qwen3-235B-A22B-GPTQ-Int8
+Base Model [Qwen/Qwen3-235B-A22B](https://www.modelscope.cn/models/Qwen/Qwen3-235B-A22B)
+### 【Model Update Date】
 ```
 2025-05-09
+1. fast commit
+2. Confirmed support for launching with 8 GPUs using `tensor-parallel-size` + `expert-parallel`
+3. Must be launched with `gptq_marlin`; does not support Compute 7 GPUs: vLLM has not implemented native GPTQ MoE module
 ```
+### 【Dependencies】
 ```
 vllm==0.8.5
     border: 1px solid rgba(255, 165, 0, 0.3);
     margin: 16px 0;
 ">
+### 【💡Notes on New VLLM MoE Versions💡】
+#### 1. V0 Inference Mode is Required
+Before launching vLLM, set the following environment variable
 ```
 export VLLM_USE_V1=0
 ```
+#### 2.  A Small Bug Exists in gptq_marlin.py and Requires Patching
+Replace the file in your installation with the attached version at:
 ```.../vllm/model_executor/layers/quantization/gptq_marlin.py```
+Otherwise, you may encounter the following error:
 ```
     raise NotImplementedError(
 NotImplementedError: Apply router weight on input is not supported forfused Marlin MoE method.
     border: 1px solid rgba(255, 0, 200, 0.3);
     margin: 16px 0;
 ">
+### 【💡Notes on Qwen3-235B-A22B💡】
+#### 1. When launching vLLM, remember to enable expert parallelism (--enable-expert-parallel), otherwise multi-GPU launch on a single node (e.g., 8 GPUs) will fail.
+Example Launch Command：
 ```commandline
 vllm serve \
+    QuantTrio/Qwen3-235B-A22B-GPTQ-Int8 \
     --served-model-name Qwen3-235B-A22B-GPTQ-Int8 \
     --max-num-seqs 8 \
     --max-model-len 32768 \
 </div>
+### 【Model List】
+| FILE SIZE    | LATEST UPDATE TIME       |
 |---------|--------------|
 | `226GB` | `2025-05-09` |
+### 【Model Download】
 ```python
+from huggingface_hub import snapshot_download
+snapshot_download('QuantTrio/Qwen3-235B-A22B-GPTQ-Int8', cache_dir="local_path")
 ```
 ### 【介绍】
 # Qwen3-235B-A22B
 <a href="https://chat.qwen.ai/" target="_blank" style="margin: 2px;">