flin775/Qwen3-1.7B-AWQ-Group32

This is the AWQ(4bit) quantization model of Qwen3-1.7B, which is created by AutoAwq w/ the following config, featuring very low GPU memory usage, high throughput, and extremely fast response speed

{
  "zero_point": True,
  "q_group_size": 32,
  "w_bit": 4,
  "version": "GEMM"
}

P.S. g_group_size is set as 32 for higher accuracy than default value(128).

LLM Serving

This model is tested well with both vllm and lmdeploy, but there are some notes you need take.

Special note to vllm: to have Qwen3 models work well with version of vllm(0.9.1) , you need downgrade the lib triton to 3.2.0, otherwise vllm will shutdown with error during inference

pip install -U triton==3.2.0

For lmdeploy, it is recommended to install it from git instead of the pip official repo, as the version of 0.8.0 in pip official repo has problem work with the latest transformers lib.

pip install git+https://github.com/InternLM/lmdeploy

lmdeploy is preferred for: startup quickly, reference quickly and lower gpu memory usage.

flin775
/

Qwen3-1.7B-AWQ-Group32

LLM Serving

Model tree for flin775/Qwen3-1.7B-AWQ-Group32