Qwen3-32B-AWQ-GEMM-sc

Original Model: https://huggingface.co/Qwen/Qwen3-32B

Created with the latest AutoAWQ. The calibration was done on short context and 64 samples with the code below.

Quantization quality

Testing pre/post-quantization with lm_eval (https://github.com/EleutherAI/lm-evaluation-harness) using this command:

VLLM_WORKER_MULTIPROC_METHOD=spawn lm_eval --model vllm \
  --model_args pretrained="<model>",add_bos_token=true,tensor_parallel_size=4 \
  --tasks gsm8k \
  --num_fewshot 5 \
  --limit 250 \
  --batch_size 'auto'

yields the following results:

Model Filter Metric Value
original flexible-extract exact_match ↑ 0.652 Β± 0.0302
original strict-match exact_match ↑ 0.748 Β± 0.0275
w4a16 flexible-extract exact_match ↑ 0.660 Β± 0.0300
w4a16 strict-match exact_match ↑ 0.688 Β± 0.0294

Quantization details

model_path = '/mnt/lcache/sglang/models/Qwen/Qwen3-32B'

model = AutoAWQForCausalLM.from_pretrained(model_path)
tokenizer = AutoTokenizer.from_pretrained(model_path, trust_remote_code=True)

quant_config = {"zero_point": True, "q_group_size": 128, "w_bit": 4, "version": "GEMM"}
model.quantize(tokenizer, quant_config=quant_config)
# AWQ: 100%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ| 64/64 [1:09:36<00:00, 65.26s/it]

quant_path = './Qwen3-32B-AWQ-4bit-GEMM-sc'
model.save_quantized(quant_path)
tokenizer.save_pretrained(quant_path)

Final notes

The quant appears to be significantly degraded. I'm trying one more quantization with 128 samples, a different dataset (HuggingFaceTB/cosmopedia-100k), and a longer max sequence length (40960). It will be ready in a few hours, and I'll upload it here: https://huggingface.co/kmouratidis/Qwen3-32B-AWQ-GEMM-lc

Downloads last month
19
Safetensors
Model size
5.73B params
Tensor type
I32
Β·
BF16
Β·
FP16
Β·
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Model tree for kmouratidis/Qwen3-32B-AWQ-GEMM-sc

Base model

Qwen/Qwen3-32B
Quantized
(90)
this model