# Faster Foundation Models with `torch.compile`

## Introduction to `torch.compile()`

This guide aims to provide a benchmark on the inference speed-ups introduced with `torch.compile()` with no reduction in model performance for foundation models in 🤗 Transformers.

Most used `torch.compile` modes are following:

- "default" is the default mode, which is a good balance between performance and overhead

- "reduce-overhead" reduces the overhead of python with CUDA graphs, useful for small batches, consumes a lot of memory. As of now only works for CUDA only graphs which do not mutate inputs.

If you have a lot of memory to use, the best speed-up is through `reduce-overhead`. How much speed-up one can get depends on the model, so in this tutorial we will check the most used foundation models.

## OWLv2

OWLv2 is a zero-shot object detection model released by Google Brain. We will load base version.

Let's load the model and processor for OWLv2.

In [1]:
from PIL import Image
import requests

url = 'https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/bee.jpg'
image = Image.open(requests.get(url, stream=True).raw)

In [2]:
from transformers import AutoProcessor, Owlv2ForObjectDetection
import torch
import numpy as np

processor = AutoProcessor.from_pretrained("google/owlv2-base-patch16-ensemble")
model = Owlv2ForObjectDetection.from_pretrained("google/owlv2-base-patch16-ensemble").to("cuda")

texts = [["a photo of a bee", "a photo of a bird"]]
inputs = processor(text=texts, images=image, return_tensors="pt").to("cuda")

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


We can now get to benchmarking. We will benchmark the model itself and the compiled model.

In [3]:
starter, ender = torch.cuda.Event(enable_timing=True), torch.cuda.Event(enable_timing=True)
repetitions = 30
timings=np.zeros((repetitions,1))

for _ in range(10):
 _ = model(**inputs)

with torch.no_grad():
 for rep in range(repetitions):
 torch.cuda.synchronize()
 starter.record()
 output = model(**inputs)
 ender.record()
 torch.cuda.synchronize()
 curr_time = starter.elapsed_time(ender)
 timings[rep] = curr_time

mean_syn = np.sum(timings) / repetitions
print(mean_syn)


255.7331792195638


In [4]:
starter, ender = torch.cuda.Event(enable_timing=True), torch.cuda.Event(enable_timing=True)
timings=np.zeros((repetitions,1))

compiled_model = torch.compile(model, mode="reduce-overhead").to("cuda")

for _ in range(30):
 with torch.no_grad():
 _ = compiled_model(**inputs)


with torch.no_grad():
 for rep in range(repetitions):
 torch.cuda.synchronize()
 starter.record()
 output = compiled_model(**inputs)
 ender.record()
 torch.cuda.synchronize()
 curr_time = starter.elapsed_time(ender)
 timings[rep] = curr_time

mean_syn = np.sum(timings) / repetitions
print(mean_syn)

 self.pid = os.fork()
skipping cudagraphs due to skipping cudagraphs due to cpu device. Found from : 
 File "/usr/local/lib/python3.10/dist-packages/transformers/models/owlv2/modeling_owlv2.py", line 1711, in forward
 pred_boxes = self.box_predictor(image_feats, feature_map)
 File "/usr/local/lib/python3.10/dist-packages/transformers/models/owlv2/modeling_owlv2.py", line 1374, in box_predictor
 box_bias = self.box_bias.to(feature_map.device)



154.6884775797526


We got nearly 40 percent speed-up! You can also increase the batch size and see how much further speed-up you can get.

In [11]:
texts = [["a photo of a bee", "a photo of a bird"] for _ in range(8)]
images = [image for _ in range(8)]
inputs = processor(text=texts, images=image, return_tensors="pt").to("cuda")

In [12]:
starter, ender = torch.cuda.Event(enable_timing=True), torch.cuda.Event(enable_timing=True)
repetitions = 30
timings=np.zeros((repetitions,1))

for _ in range(10):
 _ = model(**inputs)

with torch.no_grad():
 for rep in range(repetitions):
 torch.cuda.synchronize()
 starter.record()
 output = model(**inputs)
 ender.record()
 torch.cuda.synchronize()
 curr_time = starter.elapsed_time(ender)
 timings[rep] = curr_time

mean_syn = np.sum(timings) / repetitions
print(mean_syn)

269.3023401896159


In [13]:
starter, ender = torch.cuda.Event(enable_timing=True), torch.cuda.Event(enable_timing=True)
timings=np.zeros((repetitions,1))

compiled_model = torch.compile(model, mode="reduce-overhead").to("cuda")

for _ in range(30):
 with torch.no_grad():
 _ = compiled_model(**inputs)


with torch.no_grad():
 for rep in range(repetitions):
 torch.cuda.synchronize()
 starter.record()
 output = compiled_model(**inputs)
 ender.record()
 torch.cuda.synchronize()
 curr_time = starter.elapsed_time(ender)
 timings[rep] = curr_time

mean_syn = np.sum(timings) / repetitions
print(mean_syn)

159.77137603759766
