This is the AWQ(4bit) quantization model of Qwen3-1.7B, which is created by AutoAwq w/ the following config, featuring very low GPU memory usage, high throughput, and extremely fast response speed
{
"zero_point": True,
"q_group_size": 32,
"w_bit": 4,
"version": "GEMM"
}
P.S. g_group_size
is set as 32 for higher accuracy than default value(128).
LLM Serving
This model is tested well with both vllm and lmdeploy, but there are some notes you need take.
- Special note to vllm: to have Qwen3 models work well with version of
vllm(0.9.1)
, you need downgrade the libtriton
to3.2.0
, otherwise vllm will shutdown with error during inference
pip install -U triton==3.2.0
- For lmdeploy, it is recommended to install it from git instead of the pip official repo, as the version of 0.8.0 in pip official repo has problem work with the latest transformers lib.
pip install git+https://github.com/InternLM/lmdeploy
lmdeploy is preferred for: startup quickly, reference quickly and lower gpu memory usage.
- Downloads last month
- 4
Inference Providers
NEW
This model isn't deployed by any Inference Provider.
馃檵
Ask for provider support