Not working with inference !

#1
by alielfilali01 - opened

Hi @ybelkada , i didn't know who else to tag here, Well after quantizing this model, i tried to load it and it did successfully ! i just assumed that if loads then it works! but it comes to me that it does not work with inference !
Load script :

# !pip install 'git+https://github.com/huggingface/transformers.git'
# !pip install 'git+https://github.com/TimDettmers/bitsandbytes.git'
# !pip install 'git+https://github.com/huggingface/accelerate.git'
# !pip install 'git+https://github.com/huggingface/optimum.git'
# !pip install auto-gptq

import torch
from transformers import AutoModelForCausalLM, AutoTokenizer, GPTQConfig

device = "cuda" if torch.cuda.is_available() else "cpu"

model_path = "Ali-C137/jais-13b-chat-GPTQ"
quantization_config = GPTQConfig(bits=4, disable_exllama=True)
model = AutoModelForCausalLM.from_pretrained(model_path, device_map="auto", quantization_config=quantization_config, trust_remote_code=True, low_cpu_mem_usage=True)
tokenizer = AutoTokenizer.from_pretrained(model_path)

Inference script :

text = "Hello my name is"
inputs = tokenizer(text, return_tensors="pt").to(0)

out = model.generate(**inputs, max_new_tokens=50)
print(tokenizer.decode(out[0], skip_special_tokens=True))

But i get this error back :

File /usr/local/lib/python3.10/dist-packages/auto_gptq/nn_modules/qlinear/qlinear_cuda_old.py:348, in QuantLinear.forward(self, x)
    345 else:
    346     raise NotImplementedError("Only 2,3,4,8 bits are supported.")
--> 348 weight = scales * (weight - zeros)
    349 weight = weight.reshape(weight.shape[0] * weight.shape[1], weight.shape[2])
    350 out = torch.matmul(x, weight)

*RuntimeError*: The size of tensor a (13653) must match the size of tensor b (13632) at non-singleton dimension 2

Can you please help with that ?

cc : @derek-thomas have you tried to quantize the model before ?

I was able to use 8-bin IIRC. Not sure if it works well with 4-bits.

Sign up or log in to comment