Diffusers documentation

Quantization

You are viewing main version, which requires installation from source. If you'd like regular pip install, checkout the latest stable version (v0.33.1).
Hugging Face's logo
Join the Hugging Face community

and get access to the augmented documentation experience

to get started

Quantization

Quantization techniques focus on representing data with less information while also trying to not lose too much accuracy. This often means converting a data type to represent the same information with fewer bits. For example, if your model weights are stored as 32-bit floating points and they’re quantized to 16-bit floating points, this halves the model size which makes it easier to store and reduces memory-usage. Lower precision can also speedup inference because it takes less time to perform calculations with fewer bits.

Interested in adding a new quantization method to Diffusers? Refer to the Contribute new quantization method guide to learn more about adding a new quantization method.

If you are new to the quantization field, we recommend you to check out these beginner-friendly courses about quantization in collaboration with DeepLearning.AI:

When to use what?

Diffusers currently supports the following quantization methods.

This resource provides a good overview of the pros and cons of different quantization techniques.

Pipeline-level quantization

Diffusers allows users to directly initialize pipelines from checkpoints that may contain quantized models (example). However, users may want to apply quantization on-the-fly when initializing a pipeline from a pre-trained and non-quantized checkpoint. You can do this with PipelineQuantizationConfig.

Start by defining a PipelineQuantizationConfig:

import torch
from diffusers import DiffusionPipeline
from diffusers.quantizers.quantization_config import QuantoConfig
from diffusers.quantizers import PipelineQuantizationConfig
from transformers import BitsAndBytesConfig

pipeline_quant_config = PipelineQuantizationConfig(
    quant_mapping={
        "transformer": QuantoConfig(weights_dtype="int8"),
        "text_encoder_2": BitsAndBytesConfig(
            load_in_4bit=True, compute_dtype=torch.bfloat16
        ),
    }
)

Then pass it to from_pretrained() and run inference:

pipe = DiffusionPipeline.from_pretrained(
    "black-forest-labs/FLUX.1-dev",
    quantization_config=pipeline_quant_config,
    torch_dtype=torch.bfloat16,
).to("cuda")

image = pipe("photo of a cute dog").images[0]

This method allows for more granular control over the quantization specifications of individual model-level components of a pipeline. It also allows for different quantization backends for different components. In the above example, you used a combination of Quanto and BitsandBytes. However, one caveat of this method is that users need to know which components come from transformers to be able to import the right quantization config class.

The other method is simpler in terms of experience but is less-flexible. Start by defining a PipelineQuantizationConfig but in a different way:

pipeline_quant_config = PipelineQuantizationConfig(
    quant_backend="bitsandbytes_4bit",
    quant_kwargs={"load_in_4bit": True, "bnb_4bit_quant_type": "nf4", "bnb_4bit_compute_dtype": torch.bfloat16},
    components_to_quantize=["transformer", "text_encoder_2"],
)

This pipeline_quant_config can now be passed to from_pretrained() similar to the above example.

In this case, quant_kwargs will be used to initialize the quantization specifications of the respective quantization configuration class of quant_backend. components_to_quantize is used to denote the components that will be quantized. For most pipelines, you would want to keep transformer in the list as that is often the most compute and memory intensive.

The config below will work for most diffusion pipelines that have a transformer component present. In most case, you will want to quantize the transformer component as that is often the most compute- intensive part of a diffusion pipeline.

pipeline_quant_config = PipelineQuantizationConfig(
    quant_backend="bitsandbytes_4bit",
    quant_kwargs={"load_in_4bit": True, "bnb_4bit_quant_type": "nf4", "bnb_4bit_compute_dtype": torch.bfloat16},
    components_to_quantize=["transformer"],
)

Below is a list of the supported quantization backends available in both diffusers and transformers:

  • bitsandbytes_4bit
  • bitsandbytes_8bit
  • gguf
  • quanto
  • torchao

Diffusion pipelines can have multiple text encoders. FluxPipeline has two, for example. It’s recommended to quantize the text encoders that are memory-intensive. Some examples include T5, Llama, Gemma, etc. In the above example, you quantized the T5 model of FluxPipeline through text_encoder_2 while keeping the CLIP model intact (accessible through text_encoder).

< > Update on GitHub