Make this work directly with AutoModelForCausalLM.from_pretrained
Since 🤗transformers now has native support for GPTQ-quantized models,
quantized models can now be loaded and used just by callingAutoModelForCausalLM.from_pretrained('your_model')
TheBloke GPTQ models already support this but yours doesn't yet.
It would be nice to see this change, as it then could be directly used in many scripts without much code alteration.
I have already done this on a private repo, so I'll let you know the steps I took to make it work:
- Rename the safetensors model file to model.safetensors
- The safetensors file lacks metadata, which the 🤗transformers backend relies on. Therefore, add metadata.
I used the safetensors util to add the metadata https://github.com/by321/safetensors_util
I just added the metadata equivalent of TheBLoke's llama2 variant, which was the following config
{
"__metadata__": {
"format": "pt",
"quantized_by": "RuterNorway"
}
}
If you'd like, I could make a pull reuqest, but I figured you might just do it youself so you don't have to spend time verifying everything.
FYI: I have not tested to see if this still works with the exllama notebooks and your example code, just that it works with AutoModelForCausalLM.from_pretrained