How to All Utilize all GPU's when device="balanced_low_0" in GPU setting

#185
by kmukeshreddy - opened

While loading the model in "balanced_low_0" GPU setting the model is loaded into all GPU's apart from 0: GPU. Where the 0: GPU is left to do the text inference. (i.e. text inference as in performing all the calculation to generate response inside the LLM)

So, as per the give device parameter my model is loaded onto 1,2,3 GPU's and 0: GPU is left for inference.

| ID | GPU | MEM |

| 0 | 0% | 3% |
| 1 | 0% | 83% |
| 2 | 0% | 82% |
| 3 | 0% | 76% |

Question: How can i also utilize the remaining 1,2,3 GPU's to perform text inference not only 0:GPU?

Context: "balanced_low_0" evenly splits the model on all GPUs except the first one, and only puts on GPU 0 what does not fit on the others. This option is great when you need to use GPU 0 for some processing of the outputs, like when using the generate function for Transformers models

Reference: https://huggingface.co/docs/accelerate/en/concept_guides/big_model_inference#designing-a-device-map

Hello @ybelkada , Hope you are doing well!
Hopefully looking for your comment on the above issue.
Thanks in advance!!

Sign up or log in to comment