can u explain more how to load
why are you doing pipeline parallelism 2 and -tp 2 instead of just -tp 4 etc. is the dtype nec. for loading?
Hi, the last time I tested GLM-4.5, tensor parallelism with more than 2 gpu did not work on vllm, and yes, --dtype float16
was necessary for loading. I am not really sure now, as maybe there has been updates on vllm that more than 2 gpus can be used and --dtype float16
is no longer necessary.
ok ill download again and try. ive been trying to load GLM-4.5 all week in 4bit with 3 rtx pro 6000. --pipeline-parallel-size 3 -tp 1. depednig of which version of vllm 10 vs 10.1 etc ive been able to load some of yours and or quanttrios but then vllm throws errors and crashes when it gets a query. hopefully today it will work
thanks by the way for sharing your quants. i was curious is this quant by intel awq or gptq? or some other type of Int4? because this works for me. https://huggingface.co/Intel/Qwen3-Coder-480B-A35B-Instruct-int4-mixed-AutoRound
got it running! CUDA_DEVICE_ORDER=PCI_BUS_ID CUDA_VISIBLE_DEVICES=0,1,2 vllm serve cpatonn/GLM-4.5-AWQ-4bit --gpu-memory-utilization 0.94 --max-model-len 40960 --port 8001 --served-model-name GLM-4.5 --pipeline-parallel-size 3 -tp 1 --dtype float16
3 x rtx 6000 pro