This is the AWQ(4bit) quantization model of Qwen3-1.7B, which is created by AutoAwq w/ the following config, featuring very low GPU memory usage, high throughput, and extremely fast response speed

{
  "zero_point": True,
  "q_group_size": 32,
  "w_bit": 4,
  "version": "GEMM"
}

P.S. g_group_size is set as 32 for higher accuracy than default value(128).

LLM Serving

This model is tested well with both vllm and lmdeploy, but there are some notes you need take.

  • Special note to vllm: to have Qwen3 models work well with version of vllm(0.9.1) , you need downgrade the lib triton to 3.2.0, otherwise vllm will shutdown with error during inference
pip install -U triton==3.2.0
  • For lmdeploy, it is recommended to install it from git instead of the pip official repo, as the version of 0.8.0 in pip official repo has problem work with the latest transformers lib.
pip install git+https://github.com/InternLM/lmdeploy

lmdeploy is preferred for: startup quickly, reference quickly and lower gpu memory usage.

Downloads last month
4
Safetensors
Model size
537M params
Tensor type
I32
BF16
FP16
Inference Providers NEW
This model isn't deployed by any Inference Provider. 馃檵 Ask for provider support

Model tree for flin775/Qwen3-1.7B-AWQ-Group32

Finetuned
Qwen/Qwen3-1.7B
Quantized
(72)
this model