HIGGS

HIGGS is a zero-shot quantization algorithm that combines Hadamard preprocessing with MSE-Optimal quantization grids to achieve lower quantization error and state-of-the-art performance.

Runtime support for HIGGS is implemented through the FLUTE library. Only the 70B and 405B variants of Llama 3 and Llama 3.0, and the 8B and 27B variants of Gemma 2 are currently supported. HIGGS also doesn’t support quantized training and backward passes in general at the moment.

Run the command below to install FLUTE.

CUDA 12.1

CUDA 11.8

Create a HiggsConfig with the number of bits to quantize a model to.

from transformers import AutoModelForCausalLM, AutoTokenizer, HiggsConfig

model = AutoModelForCausalLM.from_pretrained(
    "google/gemma-2-9b-it",
    quantization_config=HiggsConfig(bits=4),
    device_map="auto",
)

Find models pre-quantized with HIGGS in the official ISTA-DASLab collection.

torch.compile

HIGGS is fully compatible with torch.compile.

import torch
from transformers import AutoModelForCausalLM, AutoTokenizer, HiggsConfig

model = AutoModelForCausalLM.from_pretrained(
    "google/gemma-2-9b-it",
    quantization_config=HiggsConfig(bits=4),
    device_map="auto",
)

model = torch.compile(model)

Refer to the table below for a benchmark of forward passes/sec for Llama-3.1-8B-Instruct on a RTX4090.

Batch Size	BF16 (with `torch.compile`)	HIGGS 4bit (without `torch.compile`)	HIGGS 4bit (with `torch.compile`)
1	59	41	124
4	57	42	123
16	56	41	120

Update on GitHub

Transformers

HIGGS

torch.compile