AWQ version

#8
by celsowm - opened

Please,
Release AWQ version

Thanks !

The AWQ quant tools do not support vision models yet AFAIK.

I tried the latest llm-compressor (as AutoAWQ has been adopted by the vLLM project - but their newest example for GPTQ as an alternative failed for me due to OOM (even with 256 GB RAM - not VRAM).

https://github.com/vllm-project/llm-compressor/blob/main/examples/multimodal_vision/mistral3_example.py

Support by the Mistral AI team on llm-compressor would be nice.

Well, the experimental script for creating a FP8 quant did work.

For those who are interested, give stelterlab/Mistral-Small-3.2-24B-Instruct-2506-FP8 a try.

vllm came up with some errors and warnings, but it seem to work (using v0.9.1 on a L40, reducing the max model len, max image count, using fp8 cache...).

INFO 06-25 18:15:13 [worker.py:294] Memory profiling takes 6.48 seconds
INFO 06-25 18:15:13 [worker.py:294] the current vLLM instance can use total_gpu_memory (44.39GiB) x gpu_memory_utilization (0.98) = 43.50GiB
INFO 06-25 18:15:13 [worker.py:294] model weights take 24.05GiB; non_torch_memory takes 0.28GiB; PyTorch activation peak memory takes 3.65GiB; the rest of the memory reserved for KV Cache is 15.52GiB.

Well, the experimental script for creating a FP8 quant did work.

For those who are interested, give stelterlab/Mistral-Small-3.2-24B-Instruct-2506-FP8 a try.

vllm came up with some errors and warnings, but it seem to work (using v0.9.1 on a L40, reducing the max model len, max image count, using fp8 cache...).

INFO 06-25 18:15:13 [worker.py:294] Memory profiling takes 6.48 seconds
INFO 06-25 18:15:13 [worker.py:294] the current vLLM instance can use total_gpu_memory (44.39GiB) x gpu_memory_utilization (0.98) = 43.50GiB
INFO 06-25 18:15:13 [worker.py:294] model weights take 24.05GiB; non_torch_memory takes 0.28GiB; PyTorch activation peak memory takes 3.65GiB; the rest of the memory reserved for KV Cache is 15.52GiB.

Would you mind to share your vllm params ?

vllm serve Mistral-Small-3.2-24B-Instruct-2506 --tokenizer-mode mistral --config-format mistral --load-format mistral --tool-call-parser mistral --enable-auto-tool-choice --port 8101 --gpu-memory-utilization 0.98 --max-model-len 16384 --limit_mm_per_prompt 'image=2' --kv-cache-dtype fp8

Did work for me.

The AWQ quant tools do not support vision models yet AFAIK.

I tried the latest llm-compressor (as AutoAWQ has been adopted by the vLLM project - but their newest example for GPTQ as an alternative failed for me due to OOM (even with 256 GB RAM - not VRAM).

https://github.com/vllm-project/llm-compressor/blob/main/examples/multimodal_vision/mistral3_example.py

Support by the Mistral AI team on llm-compressor would be nice.

How do you suppose the 'OPEA/Mistral-Small-3.1-24B-Instruct-2503-int4-AutoRound-awq-sym' model works?

I've been using this quantized version for some time and it's been working great with vLLM.

Well. In their model card that stated that they used Intel's auto-round toolkit. https://github.com/intel/auto-round

I wasn't aware that they also support CUDA as platform. I had the impression it's Intel CPU/NPU only.

Will give it a try at the weekend. Is the OPEA quant text-only or does it support also image-text-to-text?

Well. In their model card that stated that they used Intel's auto-round toolkit. https://github.com/intel/auto-round

I wasn't aware that they also support CUDA as platform. I had the impression it's Intel CPU/NPU only.

Will give it a try at the weekend. Is the OPEA quant text-only or does it support also image-text-to-text?

It's recognizes images as well. That would be great if you could do this!

Well, it seems that at least the current version of auto-round is not yet ready for this Mistral Version.

KeyError: <class 'transformers.models.mistral3.configuration_mistral3.Mistral3Config'>

I will have to take a deeper look into it and/or ask the OPEA team what they did for v3.1.

Well, it seems that at least the current version of auto-round is not yet ready for this Mistral Version.

KeyError: <class 'transformers.models.mistral3.configuration_mistral3.Mistral3Config'>

I will have to take a deeper look into it and/or ask the OPEA team what they did for v3.1.

unsloth/Mistral-Small-3.2-24B-Instruct-2506 seems to load.

Well, it seems that at least the current version of auto-round is not yet ready for this Mistral Version.

KeyError: <class 'transformers.models.mistral3.configuration_mistral3.Mistral3Config'>

I will have to take a deeper look into it and/or ask the OPEA team what they did for v3.1.

unsloth/Mistral-Small-3.2-24B-Instruct-2506 seems to load.

I've only seen bnb and ggufs from unsloth.

Sign up or log in to comment