Distributed reasoning
#23
by
Lucaslym
- opened
I want to use the Qwen2.5-VL-3B-Instruct, but I only have 8 GPUs with 12GB of graphics memory. Currently, it can be loaded model onto eight GPUs, but the inference exceeds the memory of cuda. I observed that during inference, the memory of the eight GPUs was not fully utilized. May I ask how to set it up?