Model for 8 gpus

#6
by ilwoonam75 - opened

Hi.
Thank you very much for your model.
I am currently using this model to vibe code with claude code according to this site (https://www.reddit.com/r/LocalLLaMA/comments/1mfzzt4/experience_with_glm45air_claude_code/)
Everything works well, but I have 8 gpus (8*RTX 4090) and I want to increase tensor-parallel-size, pipeline-parallel-size to use all my 8 GPUs.
Can you advice me how to increase parallel size?

Hi @ilwoonam75 , thank you for trying my model. I am happy that it works well on your machine.

As CompressedTensorsWNA16MarlinMoEMethod implemented in vllm only supports the smallest group_size of 32, the maximum tensor_parallel_size for this model is 2. In your case with 8 GPUs, please use --tensor-parallel-size 2 --pipeline-parallel-size 4, assuming that there is only one node in your cluster i.e., --data-parallel-size 1.

There might be changes to this model or vllm in the future, that would allow higher tensor parallelism.

Please let me know if there's any errors during your runtime.

Thank you very much for your answer.
I tried with --pipeline-parallel-size 4, the model launched successfully.
But when I connect from claude code router, the answer from model is not correct.
I mean that the answer is correct with "--pipeline-parallel-size 2", but the answer is weird with "--pipeline-parallel-size 4" options.
The reason I want to use 8 GPUs is that I want to increase "max-model-len". With 4 gpus, the "max-model-len" is limited, I want to use the model with its full max-model-len.

It's nice that the model launched sucessfully.

Could you provide me some specific examples with the model providing weird answers with --pipeline-parallel-size 4 and vllm logs? Does the model provide incoherent outputs, or does it miss some special tokens?

Sign up or log in to comment