{ "cells": [ { "cell_type": "markdown", "metadata": { "id": "axYlcDTznci4" }, "source": [ "# Faster Foundation Models with `torch.compile`" ] }, { "cell_type": "markdown", "metadata": { "id": "B-yw8KMWsjfY" }, "source": [ "## Introduction to `torch.compile()`" ] }, { "cell_type": "markdown", "metadata": { "id": "AmmT4aDnqgOB" }, "source": [ "This guide aims to provide a benchmark on the inference speed-ups introduced with `torch.compile()` with no reduction in model performance for foundation models in 🤗 Transformers.\n", "\n", "Most used `torch.compile` modes are following:\n", "\n", "- \"default\" is the default mode, which is a good balance between performance and overhead\n", "\n", "- \"reduce-overhead\" reduces the overhead of python with CUDA graphs, useful for small batches, consumes a lot of memory. As of now only works for CUDA only graphs which do not mutate inputs.\n", "\n", "If you have a lot of memory to use, the best speed-up is through `reduce-overhead`. How much speed-up one can get depends on the model, so in this tutorial we will check the most used foundation models." ] }, { "cell_type": "markdown", "metadata": { "id": "5sCfbPTn7wBE" }, "source": [ "## OWLv2\n", "\n", "OWLv2 is a zero-shot object detection model released by Google Brain. We will load base version." ] }, { "cell_type": "markdown", "metadata": { "id": "joeX3J315K0G" }, "source": [ "Let's load the model and processor for OWLv2." ] }, { "cell_type": "code", "execution_count": 1, "metadata": { "id": "Ztfcdqkul62z" }, "outputs": [], "source": [ "from PIL import Image\n", "import requests\n", "\n", "url = 'https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/bee.jpg'\n", "image = Image.open(requests.get(url, stream=True).raw)" ] }, { "cell_type": "code", "execution_count": 2, "metadata": { "colab": { "base_uri": "https://localhost:8080/" }, "id": "84npPHCQpHZ6", "outputId": "f30c41c7-b897-460d-d2a4-a1276bf2263e" }, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "/usr/local/lib/python3.10/dist-packages/huggingface_hub/utils/_token.py:89: UserWarning: \n", "The secret `HF_TOKEN` does not exist in your Colab secrets.\n", "To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.\n", "You will be able to reuse this secret in all of your notebooks.\n", "Please note that authentication is recommended but still optional to access public models or datasets.\n", " warnings.warn(\n" ] } ], "source": [ "from transformers import AutoProcessor, Owlv2ForObjectDetection\n", "import torch\n", "import numpy as np\n", "\n", "processor = AutoProcessor.from_pretrained(\"google/owlv2-base-patch16-ensemble\")\n", "model = Owlv2ForObjectDetection.from_pretrained(\"google/owlv2-base-patch16-ensemble\").to(\"cuda\")\n", "\n", "texts = [[\"a photo of a bee\", \"a photo of a bird\"]]\n", "inputs = processor(text=texts, images=image, return_tensors=\"pt\").to(\"cuda\")" ] }, { "cell_type": "markdown", "metadata": { "id": "3AedkjLu5PRo" }, "source": [ "We can now get to benchmarking. We will benchmark the model itself and the compiled model." ] }, { "cell_type": "code", "execution_count": 3, "metadata": { "colab": { "base_uri": "https://localhost:8080/" }, "id": "RQQSEgkQtXEV", "outputId": "8003590b-c4bc-4b3d-9b1b-dade853b8dd8" }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "255.7331792195638\n" ] } ], "source": [ "starter, ender = torch.cuda.Event(enable_timing=True), torch.cuda.Event(enable_timing=True)\n", "repetitions = 30\n", "timings=np.zeros((repetitions,1))\n", "\n", "for _ in range(10):\n", " _ = model(**inputs)\n", "\n", "with torch.no_grad():\n", " for rep in range(repetitions):\n", " torch.cuda.synchronize()\n", " starter.record()\n", " output = model(**inputs)\n", " ender.record()\n", " torch.cuda.synchronize()\n", " curr_time = starter.elapsed_time(ender)\n", " timings[rep] = curr_time\n", "\n", "mean_syn = np.sum(timings) / repetitions\n", "print(mean_syn)\n" ] }, { "cell_type": "code", "execution_count": 4, "metadata": { "colab": { "base_uri": "https://localhost:8080/" }, "id": "bEZiNgaupOx6", "outputId": "e5d47875-1e40-4997-e533-94bf0ff34d14" }, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "/usr/lib/python3.10/multiprocessing/popen_fork.py:66: RuntimeWarning: os.fork() was called. os.fork() is incompatible with multithreaded code, and JAX is multithreaded, so this will likely lead to a deadlock.\n", " self.pid = os.fork()\n", "/usr/local/lib/python3.10/dist-packages/torch/_inductor/compile_fx.py:124: UserWarning: TensorFloat32 tensor cores for float32 matrix multiplication available but not enabled. Consider setting `torch.set_float32_matmul_precision('high')` for better performance.\n", " warnings.warn(\n", "skipping cudagraphs due to skipping cudagraphs due to cpu device. Found from : \n", " File \"/usr/local/lib/python3.10/dist-packages/transformers/models/owlv2/modeling_owlv2.py\", line 1711, in forward\n", " pred_boxes = self.box_predictor(image_feats, feature_map)\n", " File \"/usr/local/lib/python3.10/dist-packages/transformers/models/owlv2/modeling_owlv2.py\", line 1374, in box_predictor\n", " box_bias = self.box_bias.to(feature_map.device)\n", "\n" ] }, { "name": "stdout", "output_type": "stream", "text": [ "154.6884775797526\n" ] } ], "source": [ "starter, ender = torch.cuda.Event(enable_timing=True), torch.cuda.Event(enable_timing=True)\n", "timings=np.zeros((repetitions,1))\n", "\n", "compiled_model = torch.compile(model, mode=\"reduce-overhead\").to(\"cuda\")\n", "\n", "for _ in range(30):\n", " with torch.no_grad():\n", " _ = compiled_model(**inputs)\n", "\n", "\n", "with torch.no_grad():\n", " for rep in range(repetitions):\n", " torch.cuda.synchronize()\n", " starter.record()\n", " output = compiled_model(**inputs)\n", " ender.record()\n", " torch.cuda.synchronize()\n", " curr_time = starter.elapsed_time(ender)\n", " timings[rep] = curr_time\n", "\n", "mean_syn = np.sum(timings) / repetitions\n", "print(mean_syn)" ] }, { "cell_type": "markdown", "metadata": { "id": "d_0d7DwN6gBt" }, "source": [ "We got nearly 40 percent speed-up! You can also increase the batch size and see how much further speed-up you can get." ] }, { "cell_type": "code", "execution_count": 11, "metadata": { "id": "exKoOptB61UL" }, "outputs": [], "source": [ "texts = [[\"a photo of a bee\", \"a photo of a bird\"] for _ in range(8)]\n", "images = [image for _ in range(8)]\n", "inputs = processor(text=texts, images=image, return_tensors=\"pt\").to(\"cuda\")" ] }, { "cell_type": "code", "execution_count": 12, "metadata": { "colab": { "base_uri": "https://localhost:8080/" }, "id": "EFj9Pgra7Km8", "outputId": "5fefb8c0-9e86-478c-e9e2-0dbc0fa8a37b" }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "269.3023401896159\n" ] } ], "source": [ "starter, ender = torch.cuda.Event(enable_timing=True), torch.cuda.Event(enable_timing=True)\n", "repetitions = 30\n", "timings=np.zeros((repetitions,1))\n", "\n", "for _ in range(10):\n", " _ = model(**inputs)\n", "\n", "with torch.no_grad():\n", " for rep in range(repetitions):\n", " torch.cuda.synchronize()\n", " starter.record()\n", " output = model(**inputs)\n", " ender.record()\n", " torch.cuda.synchronize()\n", " curr_time = starter.elapsed_time(ender)\n", " timings[rep] = curr_time\n", "\n", "mean_syn = np.sum(timings) / repetitions\n", "print(mean_syn)" ] }, { "cell_type": "code", "execution_count": 13, "metadata": { "colab": { "base_uri": "https://localhost:8080/" }, "id": "OuQZmgTK7UCo", "outputId": "7184eb1d-b545-4bb6-b544-3effd5c2545a" }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "159.77137603759766\n" ] } ], "source": [ "starter, ender = torch.cuda.Event(enable_timing=True), torch.cuda.Event(enable_timing=True)\n", "timings=np.zeros((repetitions,1))\n", "\n", "compiled_model = torch.compile(model, mode=\"reduce-overhead\").to(\"cuda\")\n", "\n", "for _ in range(30):\n", " with torch.no_grad():\n", " _ = compiled_model(**inputs)\n", "\n", "\n", "with torch.no_grad():\n", " for rep in range(repetitions):\n", " torch.cuda.synchronize()\n", " starter.record()\n", " output = compiled_model(**inputs)\n", " ender.record()\n", " torch.cuda.synchronize()\n", " curr_time = starter.elapsed_time(ender)\n", " timings[rep] = curr_time\n", "\n", "mean_syn = np.sum(timings) / repetitions\n", "print(mean_syn)" ] } ], "metadata": { "accelerator": "GPU", "colab": { "gpuType": "L4", "machine_shape": "hm", "provenance": [] }, "kernelspec": { "display_name": "Python 3", "name": "python3" }, "language_info": { "name": "python" } }, "nbformat": 4, "nbformat_minor": 0 }