Distributed reasoning

#23
by Lucaslym - opened

I want to use the Qwen2.5-VL-3B-Instruct, but I only have 8 GPUs with 12GB of graphics memory. Currently, it can be loaded model onto eight GPUs, but the inference exceeds the memory of cuda. I observed that during inference, the memory of the eight GPUs was not fully utilized. May I ask how to set it up?

code:
image.png
cuda:
image.png
error:
image.png

Your need to confirm your account before you can post a new comment.

Sign up or log in to comment