Qwen3-32B-AWQ-GEMM-sc
Original Model: https://huggingface.co/Qwen/Qwen3-32B
Created with the latest AutoAWQ. The calibration was done on short context and 64 samples with the code below.
Quantization quality
Testing pre/post-quantization with lm_eval (https://github.com/EleutherAI/lm-evaluation-harness) using this command:
VLLM_WORKER_MULTIPROC_METHOD=spawn lm_eval --model vllm \
--model_args pretrained="<model>",add_bos_token=true,tensor_parallel_size=4 \
--tasks gsm8k \
--num_fewshot 5 \
--limit 250 \
--batch_size 'auto'
yields the following results:
Model | Filter | Metric | Value | |
---|---|---|---|---|
original | flexible-extract | exact_match | β | 0.652 Β± 0.0302 |
original | strict-match | exact_match | β | 0.748 Β± 0.0275 |
w4a16 | flexible-extract | exact_match | β | 0.660 Β± 0.0300 |
w4a16 | strict-match | exact_match | β | 0.688 Β± 0.0294 |
Quantization details
model_path = '/mnt/lcache/sglang/models/Qwen/Qwen3-32B'
model = AutoAWQForCausalLM.from_pretrained(model_path)
tokenizer = AutoTokenizer.from_pretrained(model_path, trust_remote_code=True)
quant_config = {"zero_point": True, "q_group_size": 128, "w_bit": 4, "version": "GEMM"}
model.quantize(tokenizer, quant_config=quant_config)
# AWQ: 100%|ββββββββββββββββββββββββββββββββββββ| 64/64 [1:09:36<00:00, 65.26s/it]
quant_path = './Qwen3-32B-AWQ-4bit-GEMM-sc'
model.save_quantized(quant_path)
tokenizer.save_pretrained(quant_path)
Final notes
The quant appears to be significantly degraded. I'm trying one more quantization with 128 samples, a different dataset (HuggingFaceTB/cosmopedia-100k), and a longer max sequence length (40960). It will be ready in a few hours, and I'll upload it here: https://huggingface.co/kmouratidis/Qwen3-32B-AWQ-GEMM-lc
- Downloads last month
- 19
Inference Providers
NEW
This model isn't deployed by any Inference Provider.
π
Ask for provider support
Model tree for kmouratidis/Qwen3-32B-AWQ-GEMM-sc
Base model
Qwen/Qwen3-32B