Model loading taking too much GPU memory

#2
by tehreemfarooqi - opened

Hey, when trying to load the model using the code given in the repo card, it keep giving me CUDA out of memory error. I am using NVIDIA V100 with 16 GB RAM. Given that I have run LLMs with more parameters as well as speech-to-text models on this GPU, this doesn't make sense to me. I'm using the exact code given in the repo card. Am I doing something wrong?

SILMA AI - Arabic Language Models org

Hello Tehreem and thanks for trying the model

Our model will run on 16GB GPUs only in Quantization mode, you can find the sample code here:
https://huggingface.co/silma-ai/SILMA-9B-Instruct-v1.0#quantized-versions-through-bitsandbytes

You can also find our recommended GPU requirements here:
https://huggingface.co/silma-ai/SILMA-9B-Instruct-v1.0#gpu-requirements

Finally, here is a probable technical explanation of why you got OOM:

  • Our model is 9B parameters with each parameter represented as BF/FP16 (16-bit floating-point)
  • This means that 9 billion parameters will be represented by 18 billion bytes, with each parameter requiring 2 bytes (16 bits).
  • To find the amount of memory needed, you will then need to divide 18B bytes by 1,073,741,824 (since 1GB=1,073,741,824 bytes)
  • Therefor, you will need 16.76 GB of GPU memory only to load the weights

Thanks for your reply @karimouda ! I was able to run it using a multi-GPU setup.

tehreemfarooqi changed discussion status to closed

Sign up or log in to comment