Quantization
Quantization techniques focus on representing data with less information while also trying to not lose too much accuracy. This often means converting a data type to represent the same information with fewer bits. For example, if your model weights are stored as 32-bit floating points and theyβre quantized to 16-bit floating points, this halves the model size which makes it easier to store and reduces memory-usage. Lower precision can also speedup inference because it takes less time to perform calculations with fewer bits.
Interested in adding a new quantization method to Transformers? Read the HfQuantizer guide to learn how!
If you are new to the quantization field, we recommend you to check out these beginner-friendly courses about quantization in collaboration with DeepLearning.AI:
When to use what?
The community has developed many quantization methods for various use cases. With Transformers, you can run any of these integrated methods depending on your use case because each method has their own pros and cons.
For example, some quantization methods require calibrating the model with a dataset for more accurate and βextremeβ compression (up to 1-2 bits quantization), while other methods work out of the box with on-the-fly quantization.
Another parameter to consider is compatibility with your target device. Do you want to quantize on a CPU, GPU, or Apple silicon?
In short, supporting a wide range of quantization methods allows you to pick the best quantization method for your specific use case.
Use the table below to help you decide which quantization method to use.
Quantization Method | On the fly quantization | CPU | CUDA GPU | ROCm GPU | Metal (Apple Silicon) | Intel GPU | Torch compile() | Bits | PEFT Fine Tuning | Serializable with π€Transformers | π€Transformers Support | Link to library |
---|---|---|---|---|---|---|---|---|---|---|---|---|
AQLM | π΄ | π’ | π’ | π΄ | π΄ | π΄ | π’ | 1/2 | π’ | π’ | π’ | https://github.com/Vahe1994/AQLM |
AWQ | π΄ | π’ | π’ | π’ | π΄ | π’ | ? | 4 | π’ | π’ | π’ | https://github.com/casper-hansen/AutoAWQ |
bitsandbytes | π’ | π‘ 1 | π’ | π‘ 1 | π΄ 2 | π‘ 1 | π΄ 1 | 4/8 | π’ | π’ | π’ | https://github.com/bitsandbytes-foundation/bitsandbytes |
compressed-tensors | π΄ | π’ | π’ | π’ | π΄ | π΄ | π΄ | 1/8 | π’ | π’ | π’ | https://github.com/neuralmagic/compressed-tensors |
EETQ | π’ | π΄ | π’ | π΄ | π΄ | π΄ | ? | 8 | π’ | π’ | π’ | https://github.com/NetEase-FuXi/EETQ |
GGUF / GGML (llama.cpp) | π’ | π’ | π’ | π΄ | π’ | π΄ | π΄ | 1/8 | π΄ | See Notes | See Notes | https://github.com/ggerganov/llama.cpp |
GPTQModel | π΄ | π’ 3 | π’ | π’ | π’ | π’ 4 | π΄ | 2/3/4/8 | π’ | π’ | π’ | https://github.com/ModelCloud/GPTQModel |
AutoGPTQ | π΄ | π΄ | π’ | π’ | π΄ | π΄ | π΄ | 2/3/4/8 | π’ | π’ | π’ | https://github.com/AutoGPTQ/AutoGPTQ |
HIGGS | π’ | π΄ | π’ | π΄ | π΄ | π΄ | π’ | 2/4 | π΄ | π’ | π’ | https://github.com/HanGuo97/flute |
HQQ | π’ | π’ | π’ | π΄ | π΄ | π΄ | π’ | 1/8 | π’ | π΄ | π’ | https://github.com/mobiusml/hqq/ |
optimum-quanto | π’ | π’ | π’ | π΄ | π’ | π΄ | π’ | 2/4/8 | π΄ | π΄ | π’ | https://github.com/huggingface/optimum-quanto |
FBGEMM_FP8 | π’ | π΄ | π’ | π΄ | π΄ | π΄ | π΄ | 8 | π΄ | π’ | π’ | https://github.com/pytorch/FBGEMM |
torchao | π’ | π’ | π΄ | π‘ 5 | π΄ | 4/8 | π’π΄ | π’ | https://github.com/pytorch/ao | |||
VPTQ | π΄ | π΄ | π’ | π‘ | π΄ | π΄ | π’ | 1/8 | π΄ | π’ | π’ | https://github.com/microsoft/VPTQ |
2: bitsandbytes is seeking contributors to help develop and lead the Apple Silicon backend. Interested? Contact them directly via their repo. Stipends may be available through sponsorships.
3: GPTQModel[CPU] supports 4-bit via IPEX on Intel/AMD and full bit range via Torch on Intel/AMD/Apple Silicon.
4: GPTQModel[Intel GPU] via IPEX only supports 4-bit for Intel Datacenter Max/Arc GPUs.
5: torchao only supports int4 weight on Metal (Apple Silicon).