can u explain more how to load

by Fernanda24 - opened 15 days ago

Discussion

Fernanda24

15 days ago

why are you doing pipeline parallelism 2 and -tp 2 instead of just -tp 4 etc. is the dtype nec. for loading?

cpatonn

Owner 15 days ago

Hi, the last time I tested GLM-4.5, tensor parallelism with more than 2 gpu did not work on vllm, and yes, --dtype float16 was necessary for loading. I am not really sure now, as maybe there has been updates on vllm that more than 2 gpus can be used and --dtype float16 is no longer necessary.

Fernanda24

14 days ago

•

edited 14 days ago

ok ill download again and try. ive been trying to load GLM-4.5 all week in 4bit with 3 rtx pro 6000. --pipeline-parallel-size 3 -tp 1. depednig of which version of vllm 10 vs 10.1 etc ive been able to load some of yours and or quanttrios but then vllm throws errors and crashes when it gets a query. hopefully today it will work

Fernanda24

14 days ago

thanks by the way for sharing your quants. i was curious is this quant by intel awq or gptq? or some other type of Int4? because this works for me. https://huggingface.co/Intel/Qwen3-Coder-480B-A35B-Instruct-int4-mixed-AutoRound

Fernanda24

14 days ago

This comment has been hidden (marked as Off-Topic)

Fernanda24

14 days ago

got it running! CUDA_DEVICE_ORDER=PCI_BUS_ID CUDA_VISIBLE_DEVICES=0,1,2 vllm serve cpatonn/GLM-4.5-AWQ-4bit --gpu-memory-utilization 0.94 --max-model-len 40960 --port 8001 --served-model-name GLM-4.5 --pipeline-parallel-size 3 -tp 1 --dtype float16 3 x rtx 6000 pro

cpatonn

Owner 14 days ago

@Fernanda24 I’m happy to hear that :)

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment