File size: 5,554 Bytes
c68752e
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
---
license: apache-2.0
tasks:
- text-generation
base_model:
  - Qwen/Qwen3-30B-A3B
---

## Intro

The AWQ version is quantized using [ms-swift](https://github.com/modelscope/ms-swift). You may refer to our best practice for training/fine-tuning Qwen3-models [here](https://github.com/modelscope/ms-swift/issues/4030).

Note that the AWQ version for Qwen3-MoE models are verified to be working on Transformers/vLLM. We have not have the chance to tested them on other engines.

## Inference

```python
import torch
from modelscope import AutoModelForCausalLM, AutoTokenizer

model_name = "swift/Qwen3-30B-A3B-AWQ"

# load the tokenizer and the model
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(
    model_name,
    torch_dtype=torch.float16,
    device_map="auto"
)

# prepare the model input
prompt = "Give me a short introduction to large language model."
messages = [
    {"role": "user", "content": prompt}
]
text = tokenizer.apply_chat_template(
    messages,
    tokenize=False,
    add_generation_prompt=True,
    enable_thinking=True # Switches between thinking and non-thinking modes. Default is True.
)
model_inputs = tokenizer([text], return_tensors="pt").to(model.device)

# conduct text completion
generated_ids = model.generate(
    **model_inputs,
    max_new_tokens=32768
)
output_ids = generated_ids[0][len(model_inputs.input_ids[0]):].tolist() 

# parsing thinking content
try:
    # rindex finding 151668 (</think>)
    index = len(output_ids) - output_ids[::-1].index(151668)
except ValueError:
    index = 0

thinking_content = tokenizer.decode(output_ids[:index], skip_special_tokens=True).strip("\n")
content = tokenizer.decode(output_ids[index:], skip_special_tokens=True).strip("\n")

print("thinking content:", thinking_content)
print("content:", content)
```

## Quantization

The model has undergone AWQ int4 quantization using the [ms-swift](https://github.com/modelscope/ms-swift) framework. Since the model is based on the MoE (Mixture of Experts) architecture, all `linear` layers except for `gate` and `lm_head` have been quantized.

If you have fine-tuned the model and wish to quantize the fine-tuned version, you can refer to the following quantization scripts:

- Dense Model Quantization Script: [View Here](https://github.com/modelscope/ms-swift/blob/main/examples/export/quantize/awq.sh)
- MoE Model Quantization Script: [View Here](https://github.com/modelscope/ms-swift/blob/main/examples/export/quantize/moe/awq.sh)

With these scripts, you can easily complete the quantization process for the model.


## Evaluation

We evaluate the quality of this AWQ quantization with [EvalScope](https://github.com/modelscope/evalscope). For the best practice for evaluating Qwen3 models, one may refer to the following:
- [最佳实践](https://evalscope.readthedocs.io/zh-cn/latest/best_practice/qwen3.html)
- [Best Practice](https://evalscope.readthedocs.io/en/latest/best_practice/qwen3.html)

Performance of Qwen3-30B-A3B-AWQ is evaluated on our mixed-benchmark of [Qwen3 Evaluation Collection](https://modelscope.cn/datasets/modelscope/EvalScope-Qwen3-Test), with the results listed below:

> The performance comparison of Qwen3-30B-A3B-AWQ and Qwen3-30B-A3B

| task_type   | dataset_name    | metric                  | average_score(AWQ) | average_score(without AWQ) | count |
|-------------|-----------------|-------------------------|--------------------|----------------------------|-------|
| exam        | MMLU-Pro        | AverageAccuracy         | 0.7655             | 0.7828                     | 12032 |
| exam        | MMLU-Redux      | AverageAccuracy         | 0.8746             | 0.8872                     | 5700  |
| exam        | C-Eval          | AverageAccuracy         | 0.844              | 0.8722                     | 1346  |
| instruction | IFEval          | inst_level_strict_acc   | 0.8891             | 0.8925                     | 541   |
| instruction | IFEval          | inst_level_loose_acc    | 0.9107             | 0.9174                     | 541   |
| instruction | IFEval          | prompt_level_loose_acc  | 0.8651             | 0.8651                     | 541   |
| instruction | IFEval          | prompt_level_strict_acc | 0.8373             | 0.8318                     | 541   |
| math        | MATH-500        | AveragePass@1           | 0.944              | 0.938                      | 500   |
| knowledge   | GPQA            | AveragePass@1           | 0.596              | 0.601                      | 198   |
| code        | LiveCodeBench   | Pass@1                  | 0.5275             | 0.5549                     | 182   |
| exam        | iQuiz           | AverageAccuracy         | 0.6917             | 0.7417                     | 120   |
| math        | AIME 2024       | AveragePass@1           | 0.7333             | 0.8333                     | 30    |
| math        | AIME 2025       | AveragePass@1           | 0.7                | 0.7333                     | 30    |

> NOTE: For the pass@k metric, considering time cost of evaluation, we uniformly limit the number of generated responses to 1

### Conclusion

As we can see from the comparison above, evaluatoin results across different tasks and datasets suggest that our quantized-version with AWQ exihibit minimum fluctuation on model performance. 

In fact, for most benchmarks, AWQ version performs mostly on-par with the original version, except for benchmarks (such as AIME2024 and iQuiz) where performance degration is relatively obvious.