Transformers documentation
VPTQ
VPTQ
Vector Post-Training Quantization (VPTQ) is a Post-Training Quantization (PTQ) method that leverages vector quantization to quantize LLMs at an extremely low bit-width (<2-bit). VPTQ can compress a 70B, even a 405B model, to 1-2 bits without retraining and still maintain a high-degree of accuracy. It is a lightweight quantization algorithm that takes ~17 hours to quantize a 405B model. VPTQ features agile quantization inference with low decoding overhead and high throughput and Time To First Token (TTFT).
Run the command below to install VPTQ which provides efficient kernels for inference on NVIDIA and AMD GPUs.
pip install vptq
The VPTQ-community provides a collection of VPTQ-quantized models. The model name contains information about its bitwidth (excluding cookbook, parameter, and padding overhead). Consider the [Meta-Llama-3.1-70B-Instruct-v8-k65536-256-woft] model as an example.
- The model name is Meta-Llama-3.1-70B-Instruct.
- The number of centroids is given by 65536 (2^16).
- The number of residual centroids is given by 256 (2^8).
The equivalent bit-width calculation is given by the following.
- index: log2(65536) = 16 / 8 = 2-bits
- residual index: log2(256) = 8 / 8 = 1-bit
- total bit-width: 2 + 1 = 3-bits
From here, estimate the model size by multiplying 70B * 3-bits / 8-bits/byte for a total of 26.25GB.
Load a VPTQ quantized model with from_pretrained().
from transformers import AutoTokenizer, AutoModelForCausalLM
quantized_model = AutoModelForCausalLM.from_pretrained(
"VPTQ-community/Meta-Llama-3.1-70B-Instruct-v16-k65536-65536-woft",
torch_dtype="auto",
device_map="auto"
)
To quantize your own model, refer to the VPTQ Quantization Algorithm Tutorial tutorial.
Benchmarks
VPTQ achieves better accuracy and higher throughput with lower quantization overhead across models of different sizes. The following experimental results are for reference only; VPTQ can achieve better outcomes under reasonable parameters, especially in terms of model accuracy and inference speed.
Model | bitwidth | W2↓ | C4↓ | AvgQA↑ | tok/s↑ | mem(GB) | cost/h↓ |
---|---|---|---|---|---|---|---|
LLaMA-2 7B | 2.02 | 6.13 | 8.07 | 58.2 | 39.9 | 2.28 | 2 |
2.26 | 5.95 | 7.87 | 59.4 | 35.7 | 2.48 | 3.1 | |
LLaMA-2 13B | 2.02 | 5.32 | 7.15 | 62.4 | 26.9 | 4.03 | 3.2 |
2.18 | 5.28 | 7.04 | 63.1 | 18.5 | 4.31 | 3.6 | |
LLaMA-2 70B | 2.07 | 3.93 | 5.72 | 68.6 | 9.7 | 19.54 | 19 |
2.11 | 3.92 | 5.71 | 68.7 | 9.7 | 20.01 | 19 |
Resources
See an example demo of VPTQ on the VPTQ Online Demo Space or try running the VPTQ inference notebook.
For more information, read the VPTQ paper.
< > Update on GitHub