Qwen3-30B-A3B-AWQ / README.md

Upload Qwen3 30B AWQ model

c68752e verified about 1 month ago

5.55 kB

	---
	license: apache-2.0
	tasks:
	- text-generation
	base_model:
	- Qwen/Qwen3-30B-A3B
	---

	## Intro

	The AWQ version is quantized using [ms-swift](https://github.com/modelscope/ms-swift). You may refer to our best practice for training/fine-tuning Qwen3-models [here](https://github.com/modelscope/ms-swift/issues/4030).

	Note that the AWQ version for Qwen3-MoE models are verified to be working on Transformers/vLLM. We have not have the chance to tested them on other engines.

	## Inference

	```python
	import torch
	from modelscope import AutoModelForCausalLM, AutoTokenizer

	model_name = "swift/Qwen3-30B-A3B-AWQ"

	# load the tokenizer and the model
	tokenizer = AutoTokenizer.from_pretrained(model_name)
	model = AutoModelForCausalLM.from_pretrained(
	model_name,
	torch_dtype=torch.float16,
	device_map="auto"
	)

	# prepare the model input
	prompt = "Give me a short introduction to large language model."
	messages = [
	{"role": "user", "content": prompt}
	]
	text = tokenizer.apply_chat_template(
	messages,
	tokenize=False,
	add_generation_prompt=True,
	enable_thinking=True # Switches between thinking and non-thinking modes. Default is True.
	)
	model_inputs = tokenizer([text], return_tensors="pt").to(model.device)

	# conduct text completion
	generated_ids = model.generate(
	**model_inputs,
	max_new_tokens=32768
	)
	output_ids = generated_ids[0][len(model_inputs.input_ids[0]):].tolist()

	# parsing thinking content
	try:
	# rindex finding 151668 (</think>)
	index = len(output_ids) - output_ids[::-1].index(151668)
	except ValueError:
	index = 0

	thinking_content = tokenizer.decode(output_ids[:index], skip_special_tokens=True).strip("\n")
	content = tokenizer.decode(output_ids[index:], skip_special_tokens=True).strip("\n")

	print("thinking content:", thinking_content)
	print("content:", content)
	```

	## Quantization

	The model has undergone AWQ int4 quantization using the [ms-swift](https://github.com/modelscope/ms-swift) framework. Since the model is based on the MoE (Mixture of Experts) architecture, all `linear` layers except for `gate` and `lm_head` have been quantized.

	If you have fine-tuned the model and wish to quantize the fine-tuned version, you can refer to the following quantization scripts:

	- Dense Model Quantization Script: [View Here](https://github.com/modelscope/ms-swift/blob/main/examples/export/quantize/awq.sh)
	- MoE Model Quantization Script: [View Here](https://github.com/modelscope/ms-swift/blob/main/examples/export/quantize/moe/awq.sh)

	With these scripts, you can easily complete the quantization process for the model.


	## Evaluation

	We evaluate the quality of this AWQ quantization with [EvalScope](https://github.com/modelscope/evalscope). For the best practice for evaluating Qwen3 models, one may refer to the following:
	- [最佳实践](https://evalscope.readthedocs.io/zh-cn/latest/best_practice/qwen3.html)
	- [Best Practice](https://evalscope.readthedocs.io/en/latest/best_practice/qwen3.html)

	Performance of Qwen3-30B-A3B-AWQ is evaluated on our mixed-benchmark of [Qwen3 Evaluation Collection](https://modelscope.cn/datasets/modelscope/EvalScope-Qwen3-Test), with the results listed below:

	> The performance comparison of Qwen3-30B-A3B-AWQ and Qwen3-30B-A3B

	\| task_type \| dataset_name \| metric \| average_score(AWQ) \| average_score(without AWQ) \| count \|
	\|-------------\|-----------------\|-------------------------\|--------------------\|----------------------------\|-------\|
	\| exam \| MMLU-Pro \| AverageAccuracy \| 0.7655 \| 0.7828 \| 12032 \|
	\| exam \| MMLU-Redux \| AverageAccuracy \| 0.8746 \| 0.8872 \| 5700 \|
	\| exam \| C-Eval \| AverageAccuracy \| 0.844 \| 0.8722 \| 1346 \|
	\| instruction \| IFEval \| inst_level_strict_acc \| 0.8891 \| 0.8925 \| 541 \|
	\| instruction \| IFEval \| inst_level_loose_acc \| 0.9107 \| 0.9174 \| 541 \|
	\| instruction \| IFEval \| prompt_level_loose_acc \| 0.8651 \| 0.8651 \| 541 \|
	\| instruction \| IFEval \| prompt_level_strict_acc \| 0.8373 \| 0.8318 \| 541 \|
	\| math \| MATH-500 \| AveragePass@1 \| 0.944 \| 0.938 \| 500 \|
	\| knowledge \| GPQA \| AveragePass@1 \| 0.596 \| 0.601 \| 198 \|
	\| code \| LiveCodeBench \| Pass@1 \| 0.5275 \| 0.5549 \| 182 \|
	\| exam \| iQuiz \| AverageAccuracy \| 0.6917 \| 0.7417 \| 120 \|
	\| math \| AIME 2024 \| AveragePass@1 \| 0.7333 \| 0.8333 \| 30 \|
	\| math \| AIME 2025 \| AveragePass@1 \| 0.7 \| 0.7333 \| 30 \|

	> NOTE: For the pass@k metric, considering time cost of evaluation, we uniformly limit the number of generated responses to 1

	### Conclusion

	As we can see from the comparison above, evaluatoin results across different tasks and datasets suggest that our quantized-version with AWQ exihibit minimum fluctuation on model performance.

	In fact, for most benchmarks, AWQ version performs mostly on-par with the original version, except for benchmarks (such as AIME2024 and iQuiz) where performance degration is relatively obvious.