|
--- |
|
license: apache-2.0 |
|
tasks: |
|
- text-generation |
|
base_model: |
|
- Qwen/Qwen3-30B-A3B |
|
--- |
|
|
|
## Intro |
|
|
|
The AWQ version is quantized using [ms-swift](https://github.com/modelscope/ms-swift). You may refer to our best practice for training/fine-tuning Qwen3-models [here](https://github.com/modelscope/ms-swift/issues/4030). |
|
|
|
Note that the AWQ version for Qwen3-MoE models are verified to be working on Transformers/vLLM. We have not have the chance to tested them on other engines. |
|
|
|
## Inference |
|
|
|
```python |
|
import torch |
|
from modelscope import AutoModelForCausalLM, AutoTokenizer |
|
|
|
model_name = "swift/Qwen3-30B-A3B-AWQ" |
|
|
|
# load the tokenizer and the model |
|
tokenizer = AutoTokenizer.from_pretrained(model_name) |
|
model = AutoModelForCausalLM.from_pretrained( |
|
model_name, |
|
torch_dtype=torch.float16, |
|
device_map="auto" |
|
) |
|
|
|
# prepare the model input |
|
prompt = "Give me a short introduction to large language model." |
|
messages = [ |
|
{"role": "user", "content": prompt} |
|
] |
|
text = tokenizer.apply_chat_template( |
|
messages, |
|
tokenize=False, |
|
add_generation_prompt=True, |
|
enable_thinking=True # Switches between thinking and non-thinking modes. Default is True. |
|
) |
|
model_inputs = tokenizer([text], return_tensors="pt").to(model.device) |
|
|
|
# conduct text completion |
|
generated_ids = model.generate( |
|
**model_inputs, |
|
max_new_tokens=32768 |
|
) |
|
output_ids = generated_ids[0][len(model_inputs.input_ids[0]):].tolist() |
|
|
|
# parsing thinking content |
|
try: |
|
# rindex finding 151668 (</think>) |
|
index = len(output_ids) - output_ids[::-1].index(151668) |
|
except ValueError: |
|
index = 0 |
|
|
|
thinking_content = tokenizer.decode(output_ids[:index], skip_special_tokens=True).strip("\n") |
|
content = tokenizer.decode(output_ids[index:], skip_special_tokens=True).strip("\n") |
|
|
|
print("thinking content:", thinking_content) |
|
print("content:", content) |
|
``` |
|
|
|
## Quantization |
|
|
|
The model has undergone AWQ int4 quantization using the [ms-swift](https://github.com/modelscope/ms-swift) framework. Since the model is based on the MoE (Mixture of Experts) architecture, all `linear` layers except for `gate` and `lm_head` have been quantized. |
|
|
|
If you have fine-tuned the model and wish to quantize the fine-tuned version, you can refer to the following quantization scripts: |
|
|
|
- Dense Model Quantization Script: [View Here](https://github.com/modelscope/ms-swift/blob/main/examples/export/quantize/awq.sh) |
|
- MoE Model Quantization Script: [View Here](https://github.com/modelscope/ms-swift/blob/main/examples/export/quantize/moe/awq.sh) |
|
|
|
With these scripts, you can easily complete the quantization process for the model. |
|
|
|
|
|
## Evaluation |
|
|
|
We evaluate the quality of this AWQ quantization with [EvalScope](https://github.com/modelscope/evalscope). For the best practice for evaluating Qwen3 models, one may refer to the following: |
|
- [最佳实践](https://evalscope.readthedocs.io/zh-cn/latest/best_practice/qwen3.html) |
|
- [Best Practice](https://evalscope.readthedocs.io/en/latest/best_practice/qwen3.html) |
|
|
|
Performance of Qwen3-30B-A3B-AWQ is evaluated on our mixed-benchmark of [Qwen3 Evaluation Collection](https://modelscope.cn/datasets/modelscope/EvalScope-Qwen3-Test), with the results listed below: |
|
|
|
> The performance comparison of Qwen3-30B-A3B-AWQ and Qwen3-30B-A3B |
|
|
|
| task_type | dataset_name | metric | average_score(AWQ) | average_score(without AWQ) | count | |
|
|-------------|-----------------|-------------------------|--------------------|----------------------------|-------| |
|
| exam | MMLU-Pro | AverageAccuracy | 0.7655 | 0.7828 | 12032 | |
|
| exam | MMLU-Redux | AverageAccuracy | 0.8746 | 0.8872 | 5700 | |
|
| exam | C-Eval | AverageAccuracy | 0.844 | 0.8722 | 1346 | |
|
| instruction | IFEval | inst_level_strict_acc | 0.8891 | 0.8925 | 541 | |
|
| instruction | IFEval | inst_level_loose_acc | 0.9107 | 0.9174 | 541 | |
|
| instruction | IFEval | prompt_level_loose_acc | 0.8651 | 0.8651 | 541 | |
|
| instruction | IFEval | prompt_level_strict_acc | 0.8373 | 0.8318 | 541 | |
|
| math | MATH-500 | AveragePass@1 | 0.944 | 0.938 | 500 | |
|
| knowledge | GPQA | AveragePass@1 | 0.596 | 0.601 | 198 | |
|
| code | LiveCodeBench | Pass@1 | 0.5275 | 0.5549 | 182 | |
|
| exam | iQuiz | AverageAccuracy | 0.6917 | 0.7417 | 120 | |
|
| math | AIME 2024 | AveragePass@1 | 0.7333 | 0.8333 | 30 | |
|
| math | AIME 2025 | AveragePass@1 | 0.7 | 0.7333 | 30 | |
|
|
|
> NOTE: For the pass@k metric, considering time cost of evaluation, we uniformly limit the number of generated responses to 1 |
|
|
|
### Conclusion |
|
|
|
As we can see from the comparison above, evaluatoin results across different tasks and datasets suggest that our quantized-version with AWQ exihibit minimum fluctuation on model performance. |
|
|
|
In fact, for most benchmarks, AWQ version performs mostly on-par with the original version, except for benchmarks (such as AIME2024 and iQuiz) where performance degration is relatively obvious. |
|
|
|
|