Problem while running on multiple GPUs
Hello, I used the following piece of code to check the inference of the mixtral model.
from transformers import AutoModelForCausalLM, AutoTokenizer
model_id = "./Mixtral-8x7B-Instruct-v0.1/snapshots/1e637f2d7cb0a9d6fb1922f305cb784995190a83"
tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(model_id, device_map="auto")
messages = [
{"role": "user", "content": "Hello"},
{"role": "assistant", "content": "Hello, nice to meet you. How can I help you?"},
{"role": "user", "content": "Where is Paris located?"}
]
inputs = tokenizer.apply_chat_template(messages, return_tensors="pt").to("cuda")
outputs = model.generate(inputs, max_new_tokens=128)
print(tokenizer.decode(outputs[0]))
The response I got from the model is
If the device_map is set to "cpu" and if cuda is removed from inputs, I get the right response.
So this is clearly an issue with the distribution of the model across the GPUs. I am using 4x48GB RTX 6000 Ada for the inference. Could you please tell me what is happening here?
This issue does not occur if I use mistral 7B model since it is small enough to be loaded on a single GPU. The problem starts when I use multiple GPUs.
Any help is really appreciated. Thanks!