Distributed reasoning

#23

by Lucaslym - opened Mar 24

Mar 24

I want to use the Qwen2.5-VL-3B-Instruct, but I only have 8 GPUs with 12GB of graphics memory. Currently, it can be loaded model onto eight GPUs, but the inference exceeds the memory of cuda. I observed that during inference, the memory of the eight GPUs was not fully utilized. May I ask how to set it up?

code:

cuda:

error:

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment