(Solved) "Some parameters are on the meta device device because they were offloaded to the cpu."

#9
by cptjtejur - opened

It took me a while to figure this out over the last couple of days but the solution is fairly simple. I'm sharing this here in case anyone else runs into the same problem, looking for a solution.

I run inference on a single RTX 4090. It's 24 GB of RAM should be plenty to hold both model and conversation context, yet my minimum working example always returned

Some parameters are on the meta device device because they were offloaded to the cpu.

with inference on medium complexity task taking about 20 to 30 minutes to complete.

I was initially surprised because I had never faced these problems with the original Llama-8b-Instruct-v3 model. Looking for differences in the code, Llama examples explicitly set the quantization during pipeline creation:

pipeline = transformers.pipeline("text-generation", model=model_id, model_kwargs={"torch_dtype": torch.bfloat16}, device_map="auto"

While this is something I never had to do for other models so far, e.g. Mistral-7B-instruct-v0.3, this appears to be required to avoid the offloading behavior here. The same inference tasks now complete in couple of seconds.

Sign up or log in to comment