Qwen3-30B-A3B-AWQ / README.md
ELVISIO's picture
Upload Qwen3 30B AWQ model
c68752e verified
---
license: apache-2.0
tasks:
- text-generation
base_model:
- Qwen/Qwen3-30B-A3B
---
## Intro
The AWQ version is quantized using [ms-swift](https://github.com/modelscope/ms-swift). You may refer to our best practice for training/fine-tuning Qwen3-models [here](https://github.com/modelscope/ms-swift/issues/4030).
Note that the AWQ version for Qwen3-MoE models are verified to be working on Transformers/vLLM. We have not have the chance to tested them on other engines.
## Inference
```python
import torch
from modelscope import AutoModelForCausalLM, AutoTokenizer
model_name = "swift/Qwen3-30B-A3B-AWQ"
# load the tokenizer and the model
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(
model_name,
torch_dtype=torch.float16,
device_map="auto"
)
# prepare the model input
prompt = "Give me a short introduction to large language model."
messages = [
{"role": "user", "content": prompt}
]
text = tokenizer.apply_chat_template(
messages,
tokenize=False,
add_generation_prompt=True,
enable_thinking=True # Switches between thinking and non-thinking modes. Default is True.
)
model_inputs = tokenizer([text], return_tensors="pt").to(model.device)
# conduct text completion
generated_ids = model.generate(
**model_inputs,
max_new_tokens=32768
)
output_ids = generated_ids[0][len(model_inputs.input_ids[0]):].tolist()
# parsing thinking content
try:
# rindex finding 151668 (</think>)
index = len(output_ids) - output_ids[::-1].index(151668)
except ValueError:
index = 0
thinking_content = tokenizer.decode(output_ids[:index], skip_special_tokens=True).strip("\n")
content = tokenizer.decode(output_ids[index:], skip_special_tokens=True).strip("\n")
print("thinking content:", thinking_content)
print("content:", content)
```
## Quantization
The model has undergone AWQ int4 quantization using the [ms-swift](https://github.com/modelscope/ms-swift) framework. Since the model is based on the MoE (Mixture of Experts) architecture, all `linear` layers except for `gate` and `lm_head` have been quantized.
If you have fine-tuned the model and wish to quantize the fine-tuned version, you can refer to the following quantization scripts:
- Dense Model Quantization Script: [View Here](https://github.com/modelscope/ms-swift/blob/main/examples/export/quantize/awq.sh)
- MoE Model Quantization Script: [View Here](https://github.com/modelscope/ms-swift/blob/main/examples/export/quantize/moe/awq.sh)
With these scripts, you can easily complete the quantization process for the model.
## Evaluation
We evaluate the quality of this AWQ quantization with [EvalScope](https://github.com/modelscope/evalscope). For the best practice for evaluating Qwen3 models, one may refer to the following:
- [最佳实践](https://evalscope.readthedocs.io/zh-cn/latest/best_practice/qwen3.html)
- [Best Practice](https://evalscope.readthedocs.io/en/latest/best_practice/qwen3.html)
Performance of Qwen3-30B-A3B-AWQ is evaluated on our mixed-benchmark of [Qwen3 Evaluation Collection](https://modelscope.cn/datasets/modelscope/EvalScope-Qwen3-Test), with the results listed below:
> The performance comparison of Qwen3-30B-A3B-AWQ and Qwen3-30B-A3B
| task_type | dataset_name | metric | average_score(AWQ) | average_score(without AWQ) | count |
|-------------|-----------------|-------------------------|--------------------|----------------------------|-------|
| exam | MMLU-Pro | AverageAccuracy | 0.7655 | 0.7828 | 12032 |
| exam | MMLU-Redux | AverageAccuracy | 0.8746 | 0.8872 | 5700 |
| exam | C-Eval | AverageAccuracy | 0.844 | 0.8722 | 1346 |
| instruction | IFEval | inst_level_strict_acc | 0.8891 | 0.8925 | 541 |
| instruction | IFEval | inst_level_loose_acc | 0.9107 | 0.9174 | 541 |
| instruction | IFEval | prompt_level_loose_acc | 0.8651 | 0.8651 | 541 |
| instruction | IFEval | prompt_level_strict_acc | 0.8373 | 0.8318 | 541 |
| math | MATH-500 | AveragePass@1 | 0.944 | 0.938 | 500 |
| knowledge | GPQA | AveragePass@1 | 0.596 | 0.601 | 198 |
| code | LiveCodeBench | Pass@1 | 0.5275 | 0.5549 | 182 |
| exam | iQuiz | AverageAccuracy | 0.6917 | 0.7417 | 120 |
| math | AIME 2024 | AveragePass@1 | 0.7333 | 0.8333 | 30 |
| math | AIME 2025 | AveragePass@1 | 0.7 | 0.7333 | 30 |
> NOTE: For the pass@k metric, considering time cost of evaluation, we uniformly limit the number of generated responses to 1
### Conclusion
As we can see from the comparison above, evaluatoin results across different tasks and datasets suggest that our quantized-version with AWQ exihibit minimum fluctuation on model performance.
In fact, for most benchmarks, AWQ version performs mostly on-par with the original version, except for benchmarks (such as AIME2024 and iQuiz) where performance degration is relatively obvious.