--- license: apache-2.0 tasks: - text-generation base_model: - Qwen/Qwen3-30B-A3B --- ## Intro The AWQ version is quantized using [ms-swift](https://github.com/modelscope/ms-swift). You may refer to our best practice for training/fine-tuning Qwen3-models [here](https://github.com/modelscope/ms-swift/issues/4030). Note that the AWQ version for Qwen3-MoE models are verified to be working on Transformers/vLLM. We have not have the chance to tested them on other engines. ## Inference ```python import torch from modelscope import AutoModelForCausalLM, AutoTokenizer model_name = "swift/Qwen3-30B-A3B-AWQ" # load the tokenizer and the model tokenizer = AutoTokenizer.from_pretrained(model_name) model = AutoModelForCausalLM.from_pretrained( model_name, torch_dtype=torch.float16, device_map="auto" ) # prepare the model input prompt = "Give me a short introduction to large language model." messages = [ {"role": "user", "content": prompt} ] text = tokenizer.apply_chat_template( messages, tokenize=False, add_generation_prompt=True, enable_thinking=True # Switches between thinking and non-thinking modes. Default is True. ) model_inputs = tokenizer([text], return_tensors="pt").to(model.device) # conduct text completion generated_ids = model.generate( **model_inputs, max_new_tokens=32768 ) output_ids = generated_ids[0][len(model_inputs.input_ids[0]):].tolist() # parsing thinking content try: # rindex finding 151668 () index = len(output_ids) - output_ids[::-1].index(151668) except ValueError: index = 0 thinking_content = tokenizer.decode(output_ids[:index], skip_special_tokens=True).strip("\n") content = tokenizer.decode(output_ids[index:], skip_special_tokens=True).strip("\n") print("thinking content:", thinking_content) print("content:", content) ``` ## Quantization The model has undergone AWQ int4 quantization using the [ms-swift](https://github.com/modelscope/ms-swift) framework. Since the model is based on the MoE (Mixture of Experts) architecture, all `linear` layers except for `gate` and `lm_head` have been quantized. If you have fine-tuned the model and wish to quantize the fine-tuned version, you can refer to the following quantization scripts: - Dense Model Quantization Script: [View Here](https://github.com/modelscope/ms-swift/blob/main/examples/export/quantize/awq.sh) - MoE Model Quantization Script: [View Here](https://github.com/modelscope/ms-swift/blob/main/examples/export/quantize/moe/awq.sh) With these scripts, you can easily complete the quantization process for the model. ## Evaluation We evaluate the quality of this AWQ quantization with [EvalScope](https://github.com/modelscope/evalscope). For the best practice for evaluating Qwen3 models, one may refer to the following: - [最佳实践](https://evalscope.readthedocs.io/zh-cn/latest/best_practice/qwen3.html) - [Best Practice](https://evalscope.readthedocs.io/en/latest/best_practice/qwen3.html) Performance of Qwen3-30B-A3B-AWQ is evaluated on our mixed-benchmark of [Qwen3 Evaluation Collection](https://modelscope.cn/datasets/modelscope/EvalScope-Qwen3-Test), with the results listed below: > The performance comparison of Qwen3-30B-A3B-AWQ and Qwen3-30B-A3B | task_type | dataset_name | metric | average_score(AWQ) | average_score(without AWQ) | count | |-------------|-----------------|-------------------------|--------------------|----------------------------|-------| | exam | MMLU-Pro | AverageAccuracy | 0.7655 | 0.7828 | 12032 | | exam | MMLU-Redux | AverageAccuracy | 0.8746 | 0.8872 | 5700 | | exam | C-Eval | AverageAccuracy | 0.844 | 0.8722 | 1346 | | instruction | IFEval | inst_level_strict_acc | 0.8891 | 0.8925 | 541 | | instruction | IFEval | inst_level_loose_acc | 0.9107 | 0.9174 | 541 | | instruction | IFEval | prompt_level_loose_acc | 0.8651 | 0.8651 | 541 | | instruction | IFEval | prompt_level_strict_acc | 0.8373 | 0.8318 | 541 | | math | MATH-500 | AveragePass@1 | 0.944 | 0.938 | 500 | | knowledge | GPQA | AveragePass@1 | 0.596 | 0.601 | 198 | | code | LiveCodeBench | Pass@1 | 0.5275 | 0.5549 | 182 | | exam | iQuiz | AverageAccuracy | 0.6917 | 0.7417 | 120 | | math | AIME 2024 | AveragePass@1 | 0.7333 | 0.8333 | 30 | | math | AIME 2025 | AveragePass@1 | 0.7 | 0.7333 | 30 | > NOTE: For the pass@k metric, considering time cost of evaluation, we uniformly limit the number of generated responses to 1 ### Conclusion As we can see from the comparison above, evaluatoin results across different tasks and datasets suggest that our quantized-version with AWQ exihibit minimum fluctuation on model performance. In fact, for most benchmarks, AWQ version performs mostly on-par with the original version, except for benchmarks (such as AIME2024 and iQuiz) where performance degration is relatively obvious.