Qwen3-32B-AWQ-GEMM-lc

Original Model: https://huggingface.co/Qwen/Qwen3-32B

Created with the latest AutoAWQ. The calibration was done on long context and 128 samples with the code below.

Quantization quality

Testing pre/post-quantization with lm_eval (https://github.com/EleutherAI/lm-evaluation-harness) using this command:

lm_eval --model local-completions --tasks gsm8k \
    --model_args model=Qwen/<model>,base_url=http://127.0.0.1:11435/v1/completions,max_length=32768\
    --num_fewshot 5

yields the following results:

Model	Filter	Metric		Value
original	flexible-extract	exact_match	↑	0.6232 ± 0.0133
original	strict-match	exact_match	↑	0.7415 ± 0.0121
Qwen's AWQ	flexible-extract	exact_match	↑	failed
Qwen's AWQ	strict-match	exact_match	↑	failed
w4a16(sc)	flexible-extract	exact_match	↑	0.6490 ± 0.0131
w4a16(sc)	strict-match	exact_match	↑	0.6672 ± 0.0130
w4a16(lc)	flexible-extract	exact_match	↑	0.7142 ± 0.0124
w4a16(lc)	strict-match	exact_match	↑	0.7839 ± 0.0113

Quantization details

from awq import AutoAWQForCausalLM
from transformers import AutoTokenizer
from datasets import load_dataset

def load_cosmopedia():
    data = load_dataset('HuggingFaceTB/cosmopedia-100k', split="train")
    data = data.filter(lambda x: x["text_token_length"] >= 2048)
    return [text for text in data["text"]]

model_path = 'Qwen/Qwen3-32B'
model = AutoAWQForCausalLM.from_pretrained(model_path)
tokenizer = AutoTokenizer.from_pretrained(model_path, trust_remote_code=True)

quant_path = './Qwen/Qwen3-32B-AWQ-4bit'
quant_config = { "zero_point": True, "q_group_size": 128, "w_bit": 4, "version": "GEMM" }

model.quantize(
    tokenizer,
    quant_config=quant_config,
    calib_data=load_cosmopedia(),
    n_parallel_calib_samples=1,
    max_calib_samples=128,
    max_calib_seq_len=40960
)

model.save_quantized(quant_path)
tokenizer.save_pretrained(quant_path)

kmouratidis
/

Qwen3-32B-AWQ-GEMM-lc

Qwen3-32B-AWQ-GEMM-lc

Quantization quality

Quantization details

Model tree for kmouratidis/Qwen3-32B-AWQ-GEMM-lc