can u explain more how to load

#1
by Fernanda24 - opened

why are you doing pipeline parallelism 2 and -tp 2 instead of just -tp 4 etc. is the dtype nec. for loading?

Hi, the last time I tested GLM-4.5, tensor parallelism with more than 2 gpu did not work on vllm, and yes, --dtype float16 was necessary for loading. I am not really sure now, as maybe there has been updates on vllm that more than 2 gpus can be used and --dtype float16 is no longer necessary.

ok ill download again and try. ive been trying to load GLM-4.5 all week in 4bit with 3 rtx pro 6000. --pipeline-parallel-size 3 -tp 1. depednig of which version of vllm 10 vs 10.1 etc ive been able to load some of yours and or quanttrios but then vllm throws errors and crashes when it gets a query. hopefully today it will work

thanks by the way for sharing your quants. i was curious is this quant by intel awq or gptq? or some other type of Int4? because this works for me. https://huggingface.co/Intel/Qwen3-Coder-480B-A35B-Instruct-int4-mixed-AutoRound

This comment has been hidden (marked as Off-Topic)

got it running! CUDA_DEVICE_ORDER=PCI_BUS_ID CUDA_VISIBLE_DEVICES=0,1,2 vllm serve cpatonn/GLM-4.5-AWQ-4bit --gpu-memory-utilization 0.94 --max-model-len 40960 --port 8001 --served-model-name GLM-4.5 --pipeline-parallel-size 3 -tp 1 --dtype float16 3 x rtx 6000 pro

@Fernanda24 I’m happy to hear that :)

Sign up or log in to comment