Optimum documentation

Optimization

You are viewing main version, which requires installation from source. If you'd like regular pip install, checkout the latest stable version (v1.23.3).
Hugging Face's logo
Join the Hugging Face community

and get access to the augmented documentation experience

to get started

Optimization

🤗 Optimum provides an optimum.onnxruntime package that enables you to apply graph optimization on many model hosted on the 🤗 hub using the ONNX Runtime model optimization tool.

Optimizing a model during the ONNX export

The ONNX model can be directly optimized during the ONNX export using Optimum CLI, by passing the argument --optimize {O1,O2,O3,O4} in the CLI, for example:

optimum-cli export onnx --model gpt2 --optimize O3 gpt2_onnx/

The optimization levels are:

  • O1: basic general optimizations.
  • O2: basic and extended general optimizations, transformers-specific fusions.
  • O3: same as O2 with GELU approximation.
  • O4: same as O3 with mixed precision (fp16, GPU-only, requires --device cuda).

Optimizing a model programmatically with ORTOptimizer

ONNX models can be optimized with the ORTOptimizer. The class can be initialized using the from_pretrained() method, which supports different checkpoint formats.

  1. Using an already initialized ORTModel class.
>>> from optimum.onnxruntime import ORTOptimizer, ORTModelForSequenceClassification

# Loading ONNX Model from the Hub
>>> model = ORTModelForSequenceClassification.from_pretrained(
...     "optimum/distilbert-base-uncased-finetuned-sst-2-english"
... )

# Create an optimizer from an ORTModelForXXX
>>> optimizer = ORTOptimizer.from_pretrained(model)
  1. Using a local ONNX model from a directory.
>>> from optimum.onnxruntime import ORTOptimizer

# This assumes a model.onnx exists in path/to/model
>>> optimizer = ORTOptimizer.from_pretrained("path/to/model")

Optimization Configuration

The OptimizationConfig class allows to specify how the optimization should be performed by the ORTOptimizer.

In the optimization configuration, there are 4 possible optimization levels:

  • optimization_level=0: to disable all optimizations
  • optimization_level=1: to enable basic optimizations such as constant folding or redundant node eliminations
  • optimization_level=2: to enable extended graph optimizations such as node fusions
  • optimization_level=99: to enable data layout optimizations

Choosing a level enables the optimizations of that level, as well as the optimizations of all preceding levels. More information here.

enable_transformers_specific_optimizations=True means that transformers-specific graph fusion and approximation are performed in addition to the ONNX Runtime optimizations described above. Here is a list of the possible optimizations you can enable:

  • Gelu fusion with disable_gelu_fusion=False,
  • Layer Normalization fusion with disable_layer_norm_fusion=False,
  • Attention fusion with disable_attention_fusion=False,
  • SkipLayerNormalization fusion with disable_skip_layer_norm_fusion=False,
  • Add Bias and SkipLayerNormalization fusion with disable_bias_skip_layer_norm_fusion=False,
  • Add Bias and Gelu / FastGelu fusion with disable_bias_gelu_fusion=False,
  • Gelu approximation with enable_gelu_approximation=True.

Attention fusion is designed for right-side padding for BERT-like architectures (eg. BERT, RoBERTa, VIT, etc.) and for left-side padding for generative models (GPT-like). If you are not following the convention, please set use_raw_attention_mask=True to avoid potential accuracy issues but sacrifice the performance.

While OptimizationConfig gives you full control on how to do optimization, it can be hard to know what to enable / disable. Instead, you can use AutoOptimizationConfig which provides four common optimization levels:

  • O1: basic general optimizations.
  • O2: basic and extended general optimizations, transformers-specific fusions.
  • O3: same as O2 with GELU approximation.
  • O4: same as O3 with mixed precision (fp16, GPU-only).

Example: Loading a O2 OptimizationConfig

>>> from optimum.onnxruntime import AutoOptimizationConfig
>>> optimization_config = AutoOptimizationConfig.O2()

You can also specify custom argument that were not defined in the O2 configuration, for instance:

>>> from optimum.onnxruntime import AutoOptimizationConfig
>>> optimization_config = AutoOptimizationConfig.O2(disable_embed_layer_norm_fusion=False)

Optimization examples

Below you will find an easy end-to-end example on how to optimize distilbert-base-uncased-finetuned-sst-2-english.

>>> from optimum.onnxruntime import (
...     AutoOptimizationConfig, ORTOptimizer, ORTModelForSequenceClassification
... )

>>> model_id = "distilbert-base-uncased-finetuned-sst-2-english"
>>> save_dir = "distilbert_optimized"

>>> # Load a PyTorch model and export it to the ONNX format
>>> model = ORTModelForSequenceClassification.from_pretrained(model_id, export=True)

>>> # Create the optimizer
>>> optimizer = ORTOptimizer.from_pretrained(model)

>>> # Define the optimization strategy by creating the appropriate configuration
>>> optimization_config = AutoOptimizationConfig.O2()

>>> # Optimize the model
>>> optimizer.optimize(save_dir=save_dir, optimization_config=optimization_config)

Below you will find an easy end-to-end example on how to optimize a Seq2Seq model sshleifer/distilbart-cnn-12-6”.

>>> from transformers import AutoTokenizer
>>> from optimum.onnxruntime import  OptimizationConfig, ORTOptimizer, ORTModelForSeq2SeqLM

>>> model_id = "sshleifer/distilbart-cnn-12-6"
>>> save_dir = "distilbart_optimized"

>>> # Load a PyTorch model and export it to the ONNX format
>>> model = ORTModelForSeq2SeqLM.from_pretrained(model_id, export=True)

>>> # Create the optimizer
>>> optimizer = ORTOptimizer.from_pretrained(model)

>>> # Define the optimization strategy by creating the appropriate configuration
>>> optimization_config = OptimizationConfig(
...     optimization_level=2,
...     enable_transformers_specific_optimizations=True,
...     optimize_for_gpu=False,
... )

>>> # Optimize the model
>>> optimizer.optimize(save_dir=save_dir, optimization_config=optimization_config)
>>> tokenizer = AutoTokenizer.from_pretrained(model_id)
>>> optimized_model = ORTModelForSeq2SeqLM.from_pretrained(save_dir)
>>> tokens = tokenizer("This is a sample input", return_tensors="pt")
>>> outputs = optimized_model.generate(**tokens)

Optimizing a model with Optimum CLI

The Optimum ONNX Runtime optimization tools can be used directly through Optimum command-line interface:

optimum-cli onnxruntime optimize --help
usage: optimum-cli <command> [<args>] onnxruntime optimize [-h] --onnx_model ONNX_MODEL -o OUTPUT (-O1 | -O2 | -O3 | -O4 | -c CONFIG)

options:
  -h, --help            show this help message and exit
  -O1                   Basic general optimizations (see: https://huggingface.co/docs/optimum/onnxruntime/usage_guides/optimization for more details).
  -O2                   Basic and extended general optimizations, transformers-specific fusions (see: https://huggingface.co/docs/optimum/onnxruntime/usage_guides/optimization for more
                        details).
  -O3                   Same as O2 with Gelu approximation (see: https://huggingface.co/docs/optimum/onnxruntime/usage_guides/optimization for more details).
  -O4                   Same as O3 with mixed precision (see: https://huggingface.co/docs/optimum/onnxruntime/usage_guides/optimization for more details).
  -c CONFIG, --config CONFIG
                        `ORTConfig` file to use to optimize the model.

Required arguments:
  --onnx_model ONNX_MODEL
                        Path to the repository where the ONNX models to optimize are located.
  -o OUTPUT, --output OUTPUT
                        Path to the directory where to store generated ONNX model.

Optimizing an ONNX model can be done as follows:

 optimum-cli onnxruntime optimize --onnx_model onnx_model_location/ -O1 -o optimized_model/

This optimizes all the ONNX files in onnx_model_location with the basic general optimizations.

< > Update on GitHub