Not working with inference !
#1
by
alielfilali01
- opened
Hi
@ybelkada
, i didn't know who else to tag here, Well after quantizing this model, i tried to load it and it did successfully ! i just assumed that if loads then it works! but it comes to me that it does not work with inference !
Load script :
# !pip install 'git+https://github.com/huggingface/transformers.git'
# !pip install 'git+https://github.com/TimDettmers/bitsandbytes.git'
# !pip install 'git+https://github.com/huggingface/accelerate.git'
# !pip install 'git+https://github.com/huggingface/optimum.git'
# !pip install auto-gptq
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer, GPTQConfig
device = "cuda" if torch.cuda.is_available() else "cpu"
model_path = "Ali-C137/jais-13b-chat-GPTQ"
quantization_config = GPTQConfig(bits=4, disable_exllama=True)
model = AutoModelForCausalLM.from_pretrained(model_path, device_map="auto", quantization_config=quantization_config, trust_remote_code=True, low_cpu_mem_usage=True)
tokenizer = AutoTokenizer.from_pretrained(model_path)
Inference script :
text = "Hello my name is"
inputs = tokenizer(text, return_tensors="pt").to(0)
out = model.generate(**inputs, max_new_tokens=50)
print(tokenizer.decode(out[0], skip_special_tokens=True))
But i get this error back :
File /usr/local/lib/python3.10/dist-packages/auto_gptq/nn_modules/qlinear/qlinear_cuda_old.py:348, in QuantLinear.forward(self, x)
345 else:
346 raise NotImplementedError("Only 2,3,4,8 bits are supported.")
--> 348 weight = scales * (weight - zeros)
349 weight = weight.reshape(weight.shape[0] * weight.shape[1], weight.shape[2])
350 out = torch.matmul(x, weight)
*RuntimeError*: The size of tensor a (13653) must match the size of tensor b (13632) at non-singleton dimension 2
Can you please help with that ?
cc : @derek-thomas have you tried to quantize the model before ?
I was able to use 8-bin IIRC. Not sure if it works well with 4-bits.