AWQ version

by celsowm - opened 12 days ago

Discussion

celsowm

12 days ago

Please,
Release AWQ version

Thanks !

stelterlab

10 days ago

The AWQ quant tools do not support vision models yet AFAIK.

I tried the latest llm-compressor (as AutoAWQ has been adopted by the vLLM project - but their newest example for GPTQ as an alternative failed for me due to OOM (even with 256 GB RAM - not VRAM).

https://github.com/vllm-project/llm-compressor/blob/main/examples/multimodal_vision/mistral3_example.py

Support by the Mistral AI team on llm-compressor would be nice.

stelterlab

10 days ago

Well, the experimental script for creating a FP8 quant did work.

For those who are interested, give stelterlab/Mistral-Small-3.2-24B-Instruct-2506-FP8 a try.

vllm came up with some errors and warnings, but it seem to work (using v0.9.1 on a L40, reducing the max model len, max image count, using fp8 cache...).

INFO 06-25 18:15:13 [worker.py:294] Memory profiling takes 6.48 seconds
INFO 06-25 18:15:13 [worker.py:294] the current vLLM instance can use total_gpu_memory (44.39GiB) x gpu_memory_utilization (0.98) = 43.50GiB
INFO 06-25 18:15:13 [worker.py:294] model weights take 24.05GiB; non_torch_memory takes 0.28GiB; PyTorch activation peak memory takes 3.65GiB; the rest of the memory reserved for KV Cache is 15.52GiB.

celsowm

10 days ago

Well, the experimental script for creating a FP8 quant did work.

For those who are interested, give stelterlab/Mistral-Small-3.2-24B-Instruct-2506-FP8 a try.

vllm came up with some errors and warnings, but it seem to work (using v0.9.1 on a L40, reducing the max model len, max image count, using fp8 cache...).
INFO 06-25 18:15:13 [worker.py:294] Memory profiling takes 6.48 seconds
INFO 06-25 18:15:13 [worker.py:294] the current vLLM instance can use total_gpu_memory (44.39GiB) x gpu_memory_utilization (0.98) = 43.50GiB
INFO 06-25 18:15:13 [worker.py:294] model weights take 24.05GiB; non_torch_memory takes 0.28GiB; PyTorch activation peak memory takes 3.65GiB; the rest of the memory reserved for KV Cache is 15.52GiB.

Would you mind to share your vllm params ?

stelterlab

10 days ago

vllm serve Mistral-Small-3.2-24B-Instruct-2506 --tokenizer-mode mistral --config-format mistral --load-format mistral --tool-call-parser mistral --enable-auto-tool-choice --port 8101 --gpu-memory-utilization 0.98 --max-model-len 16384 --limit_mm_per_prompt 'image=2' --kv-cache-dtype fp8

Did work for me.

docgerbil

2 days ago

•

edited 2 days ago

The AWQ quant tools do not support vision models yet AFAIK.

I tried the latest llm-compressor (as AutoAWQ has been adopted by the vLLM project - but their newest example for GPTQ as an alternative failed for me due to OOM (even with 256 GB RAM - not VRAM).

https://github.com/vllm-project/llm-compressor/blob/main/examples/multimodal_vision/mistral3_example.py

Support by the Mistral AI team on llm-compressor would be nice.

How do you suppose the 'OPEA/Mistral-Small-3.1-24B-Instruct-2503-int4-AutoRound-awq-sym' model works?

I've been using this quantized version for some time and it's been working great with vLLM.

stelterlab

2 days ago

Well. In their model card that stated that they used Intel's auto-round toolkit. https://github.com/intel/auto-round

I wasn't aware that they also support CUDA as platform. I had the impression it's Intel CPU/NPU only.

Will give it a try at the weekend. Is the OPEA quant text-only or does it support also image-text-to-text?

docgerbil

1 day ago

Well. In their model card that stated that they used Intel's auto-round toolkit. https://github.com/intel/auto-round

I wasn't aware that they also support CUDA as platform. I had the impression it's Intel CPU/NPU only.

Will give it a try at the weekend. Is the OPEA quant text-only or does it support also image-text-to-text?

It's recognizes images as well. That would be great if you could do this!

stelterlab

about 19 hours ago

Well, it seems that at least the current version of auto-round is not yet ready for this Mistral Version.

KeyError: <class 'transformers.models.mistral3.configuration_mistral3.Mistral3Config'>

I will have to take a deeper look into it and/or ask the OPEA team what they did for v3.1.

tensornet

about 4 hours ago

Well, it seems that at least the current version of auto-round is not yet ready for this Mistral Version.

KeyError: <class 'transformers.models.mistral3.configuration_mistral3.Mistral3Config'>

I will have to take a deeper look into it and/or ask the OPEA team what they did for v3.1.

unsloth/Mistral-Small-3.2-24B-Instruct-2506 seems to load.

docgerbil

41 minutes ago

Well, it seems that at least the current version of auto-round is not yet ready for this Mistral Version.

KeyError: <class 'transformers.models.mistral3.configuration_mistral3.Mistral3Config'>

I will have to take a deeper look into it and/or ask the OPEA team what they did for v3.1.

unsloth/Mistral-Small-3.2-24B-Instruct-2506 seems to load.

I've only seen bnb and ggufs from unsloth.

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment