Qwen3-32B-AWQ-GEMM-sc

Original Model: https://huggingface.co/Qwen/Qwen3-32B

Created with the latest AutoAWQ. The calibration was done on short context and 64 samples with the code below.

Quantization quality

Testing pre/post-quantization with lm_eval (https://github.com/EleutherAI/lm-evaluation-harness) using this command:

VLLM_WORKER_MULTIPROC_METHOD=spawn lm_eval --model vllm \
  --model_args pretrained="<model>",add_bos_token=true,tensor_parallel_size=4 \
  --tasks gsm8k \
  --num_fewshot 5 \
  --limit 250 \
  --batch_size 'auto'

yields the following results:

Model	Filter	Metric		Value
original	flexible-extract	exact_match	↑	0.652 ± 0.0302
original	strict-match	exact_match	↑	0.748 ± 0.0275
w4a16	flexible-extract	exact_match	↑	0.660 ± 0.0300
w4a16	strict-match	exact_match	↑	0.688 ± 0.0294

Quantization details

model_path = '/mnt/lcache/sglang/models/Qwen/Qwen3-32B'

model = AutoAWQForCausalLM.from_pretrained(model_path)
tokenizer = AutoTokenizer.from_pretrained(model_path, trust_remote_code=True)

quant_config = {"zero_point": True, "q_group_size": 128, "w_bit": 4, "version": "GEMM"}
model.quantize(tokenizer, quant_config=quant_config)
# AWQ: 100%|████████████████████████████████████| 64/64 [1:09:36<00:00, 65.26s/it]

quant_path = './Qwen3-32B-AWQ-4bit-GEMM-sc'
model.save_quantized(quant_path)
tokenizer.save_pretrained(quant_path)

Final notes

The quant appears to be significantly degraded. I'm trying one more quantization with 128 samples, a different dataset (HuggingFaceTB/cosmopedia-100k), and a longer max sequence length (40960). It will be ready in a few hours, and I'll upload it here: https://huggingface.co/kmouratidis/Qwen3-32B-AWQ-GEMM-lc

kmouratidis
/

Qwen3-32B-AWQ-GEMM-sc

Qwen3-32B-AWQ-GEMM-sc

Quantization quality

Quantization details

Final notes

Model tree for kmouratidis/Qwen3-32B-AWQ-GEMM-sc