mervenoyan commited on 3 days ago

Commit

87b3d4b

1 Parent(s): 30f4531

add files

Browse files

Files changed (26) hide show

.DS_Store +0 -0
ColPali_+_Qwen2_VL.ipynb +0 -0
Faster_Zero_shot_Object_Detection_with_Optimum.ipynb +0 -0
Faster_foundation_models_with_torch_compile.ipynb +344 -0
Fine_tune_Florence_2.ipynb +0 -0
Fine_tune_PaliGemma.ipynb +1846 -0
Fine_tune_SmolVLM2_on_Video.ipynb +0 -0
Finetune_ColPali.ipynb +0 -0
Fit_in_vision_models_using_quanto.ipynb +0 -0
Gemma_3_for_Video_Understanding.ipynb +0 -0
Gemma_3n_Video_Vibe_Tests.ipynb +1489 -0
Idefics_FT.ipynb +1866 -0
LICENSE +201 -0
PaliGemma_DPO.ipynb +0 -0
README.md +26 -3
Reduce_any_model_to_fp16_using_🤗_Optimum_DETR.ipynb +0 -0
ShieldGemma_2_for_Vision_LM_Safety.ipynb +0 -0
Smol_VLM_FT.ipynb +1271 -0
inference_gists/Aria_Inference.ipynb +0 -0
inference_gists/ColQwen2.ipynb +0 -0
inference_gists/IBM_Granite_Vision.ipynb +0 -0
inference_gists/InternVL3_Gist.ipynb +0 -0
knowledge_distillation.md +186 -0
paligemma.py +91 -0
smolvlm.py +137 -0
train_idefics2.py +132 -0

.DS_Store ADDED Viewed

Binary file (6.15 kB). View file

ColPali_+_Qwen2_VL.ipynb ADDED Viewed

The diff for this file is too large to render. See raw diff

Faster_Zero_shot_Object_Detection_with_Optimum.ipynb ADDED Viewed

The diff for this file is too large to render. See raw diff

Faster_foundation_models_with_torch_compile.ipynb ADDED Viewed

	@@ -0,0 +1,344 @@

+{
+  "nbformat": 4,
+  "nbformat_minor": 0,
+  "metadata": {
+    "colab": {
+      "provenance": [],
+      "machine_shape": "hm",
+      "gpuType": "L4"
+    },
+    "kernelspec": {
+      "name": "python3",
+      "display_name": "Python 3"
+    },
+    "language_info": {
+      "name": "python"
+    },
+    "accelerator": "GPU"
+  },
+  "cells": [
+    {
+      "cell_type": "markdown",
+      "source": [
+        "# Faster Foundation Models with `torch.compile`"
+      ],
+      "metadata": {
+        "id": "axYlcDTznci4"
+      }
+    },
+    {
+      "cell_type": "markdown",
+      "source": [
+        "## Introduction to `torch.compile()`"
+      ],
+      "metadata": {
+        "id": "B-yw8KMWsjfY"
+      }
+    },
+    {
+      "cell_type": "markdown",
+      "source": [
+        "This guide aims to provide a benchmark on the inference speed-ups introduced with `torch.compile()` with no reduction in model performance for foundation models in 🤗 Transformers.\n",
+        "\n",
+        "Most used `torch.compile` modes are following:\n",
+        "\n",
+        "- \"default\" is the default mode, which is a good balance between performance and overhead\n",
+        "\n",
+        "- \"reduce-overhead\" reduces the overhead of python with CUDA graphs, useful for small batches, consumes a lot of memory. As of now only works for CUDA only graphs which do not mutate inputs.\n",
+        "\n",
+        "If you have a lot of memory to use, the best speed-up is through `reduce-overhead`. How much speed-up one can get depends on the model, so in this tutorial we will check the most used foundation models."
+      ],
+      "metadata": {
+        "id": "AmmT4aDnqgOB"
+      }
+    },
+    {
+      "cell_type": "markdown",
+      "source": [
+        "## OWLv2\n",
+        "\n",
+        "OWLv2 is a zero-shot object detection model released by Google Brain. We will load base version."
+      ],
+      "metadata": {
+        "id": "5sCfbPTn7wBE"
+      }
+    },
+    {
+      "cell_type": "markdown",
+      "source": [
+        "Let's load the model and processor for OWLv2."
+      ],
+      "metadata": {
+        "id": "joeX3J315K0G"
+      }
+    },
+    {
+      "cell_type": "code",
+      "source": [
+        "from PIL import Image\n",
+        "import requests\n",
+        "\n",
+        "url = 'https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/bee.jpg'\n",
+        "image = Image.open(requests.get(url, stream=True).raw)"
+      ],
+      "metadata": {
+        "id": "Ztfcdqkul62z"
+      },
+      "execution_count": 1,
+      "outputs": []
+    },
+    {
+      "cell_type": "code",
+      "source": [
+        "from transformers import AutoProcessor, Owlv2ForObjectDetection\n",
+        "import torch\n",
+        "import numpy as np\n",
+        "\n",
+        "processor = AutoProcessor.from_pretrained(\"google/owlv2-base-patch16-ensemble\")\n",
+        "model = Owlv2ForObjectDetection.from_pretrained(\"google/owlv2-base-patch16-ensemble\").to(\"cuda\")\n",
+        "\n",
+        "texts = [[\"a photo of a bee\", \"a photo of a bird\"]]\n",
+        "inputs = processor(text=texts, images=image, return_tensors=\"pt\").to(\"cuda\")"
+      ],
+      "metadata": {
+        "id": "84npPHCQpHZ6",
+        "colab": {
+          "base_uri": "https://localhost:8080/"
+        },
+        "outputId": "f30c41c7-b897-460d-d2a4-a1276bf2263e"
+      },
+      "execution_count": 2,
+      "outputs": [
+        {
+          "output_type": "stream",
+          "name": "stderr",
+          "text": [
+            "/usr/local/lib/python3.10/dist-packages/huggingface_hub/utils/_token.py:89: UserWarning: \n",
+            "The secret `HF_TOKEN` does not exist in your Colab secrets.\n",
+            "To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.\n",
+            "You will be able to reuse this secret in all of your notebooks.\n",
+            "Please note that authentication is recommended but still optional to access public models or datasets.\n",
+            "  warnings.warn(\n"
+          ]
+        }
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "source": [
+        "We can now get to benchmarking. We will benchmark the model itself and the compiled model."
+      ],
+      "metadata": {
+        "id": "3AedkjLu5PRo"
+      }
+    },
+    {
+      "cell_type": "code",
+      "source": [
+        "starter, ender = torch.cuda.Event(enable_timing=True), torch.cuda.Event(enable_timing=True)\n",
+        "repetitions = 30\n",
+        "timings=np.zeros((repetitions,1))\n",
+        "\n",
+        "for _ in range(10):\n",
+        "    _ = model(**inputs)\n",
+        "\n",
+        "with torch.no_grad():\n",
+        "    for rep in range(repetitions):\n",
+        "        torch.cuda.synchronize()\n",
+        "        starter.record()\n",
+        "        output = model(**inputs)\n",
+        "        ender.record()\n",
+        "        torch.cuda.synchronize()\n",
+        "        curr_time = starter.elapsed_time(ender)\n",
+        "        timings[rep] = curr_time\n",
+        "\n",
+        "mean_syn = np.sum(timings) / repetitions\n",
+        "print(mean_syn)\n"
+      ],
+      "metadata": {
+        "id": "RQQSEgkQtXEV",
+        "colab": {
+          "base_uri": "https://localhost:8080/"
+        },
+        "outputId": "8003590b-c4bc-4b3d-9b1b-dade853b8dd8"
+      },
+      "execution_count": 3,
+      "outputs": [
+        {
+          "output_type": "stream",
+          "name": "stdout",
+          "text": [
+            "255.7331792195638\n"
+          ]
+        }
+      ]
+    },
+    {
+      "cell_type": "code",
+      "source": [
+        "starter, ender = torch.cuda.Event(enable_timing=True), torch.cuda.Event(enable_timing=True)\n",
+        "timings=np.zeros((repetitions,1))\n",
+        "\n",
+        "compiled_model = torch.compile(model, mode=\"reduce-overhead\").to(\"cuda\")\n",
+        "\n",
+        "for _ in range(30):\n",
+        "  with torch.no_grad():\n",
+        "    _ = compiled_model(**inputs)\n",
+        "\n",
+        "\n",
+        "with torch.no_grad():\n",
+        "    for rep in range(repetitions):\n",
+        "        torch.cuda.synchronize()\n",
+        "        starter.record()\n",
+        "        output = compiled_model(**inputs)\n",
+        "        ender.record()\n",
+        "        torch.cuda.synchronize()\n",
+        "        curr_time = starter.elapsed_time(ender)\n",
+        "        timings[rep] = curr_time\n",
+        "\n",
+        "mean_syn = np.sum(timings) / repetitions\n",
+        "print(mean_syn)"
+      ],
+      "metadata": {
+        "id": "bEZiNgaupOx6",
+        "colab": {
+          "base_uri": "https://localhost:8080/"
+        },
+        "outputId": "e5d47875-1e40-4997-e533-94bf0ff34d14"
+      },
+      "execution_count": 4,
+      "outputs": [
+        {
+          "output_type": "stream",
+          "name": "stderr",
+          "text": [
+            "/usr/lib/python3.10/multiprocessing/popen_fork.py:66: RuntimeWarning: os.fork() was called. os.fork() is incompatible with multithreaded code, and JAX is multithreaded, so this will likely lead to a deadlock.\n",
+            "  self.pid = os.fork()\n",
+            "/usr/local/lib/python3.10/dist-packages/torch/_inductor/compile_fx.py:124: UserWarning: TensorFloat32 tensor cores for float32 matrix multiplication available but not enabled. Consider setting `torch.set_float32_matmul_precision('high')` for better performance.\n",
+            "  warnings.warn(\n",
+            "skipping cudagraphs due to skipping cudagraphs due to cpu device. Found from : \n",
+            "   File \"/usr/local/lib/python3.10/dist-packages/transformers/models/owlv2/modeling_owlv2.py\", line 1711, in forward\n",
+            "    pred_boxes = self.box_predictor(image_feats, feature_map)\n",
+            "  File \"/usr/local/lib/python3.10/dist-packages/transformers/models/owlv2/modeling_owlv2.py\", line 1374, in box_predictor\n",
+            "    box_bias = self.box_bias.to(feature_map.device)\n",
+            "\n"
+          ]
+        },
+        {
+          "output_type": "stream",
+          "name": "stdout",
+          "text": [
+            "154.6884775797526\n"
+          ]
+        }
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "source": [
+        "We got nearly 40 percent speed-up! You can also increase the batch size and see how much further speed-up you can get."
+      ],
+      "metadata": {
+        "id": "d_0d7DwN6gBt"
+      }
+    },
+    {
+      "cell_type": "code",
+      "source": [
+        "texts = [[\"a photo of a bee\", \"a photo of a bird\"] for _ in range(8)]\n",
+        "images = [image for _ in range(8)]\n",
+        "inputs = processor(text=texts, images=image, return_tensors=\"pt\").to(\"cuda\")"
+      ],
+      "metadata": {
+        "id": "exKoOptB61UL"
+      },
+      "execution_count": 11,
+      "outputs": []
+    },
+    {
+      "cell_type": "code",
+      "source": [
+        "starter, ender = torch.cuda.Event(enable_timing=True), torch.cuda.Event(enable_timing=True)\n",
+        "repetitions = 30\n",
+        "timings=np.zeros((repetitions,1))\n",
+        "\n",
+        "for _ in range(10):\n",
+        "    _ = model(**inputs)\n",
+        "\n",
+        "with torch.no_grad():\n",
+        "    for rep in range(repetitions):\n",
+        "        torch.cuda.synchronize()\n",
+        "        starter.record()\n",
+        "        output = model(**inputs)\n",
+        "        ender.record()\n",
+        "        torch.cuda.synchronize()\n",
+        "        curr_time = starter.elapsed_time(ender)\n",
+        "        timings[rep] = curr_time\n",
+        "\n",
+        "mean_syn = np.sum(timings) / repetitions\n",
+        "print(mean_syn)"
+      ],
+      "metadata": {
+        "colab": {
+          "base_uri": "https://localhost:8080/"
+        },
+        "id": "EFj9Pgra7Km8",
+        "outputId": "5fefb8c0-9e86-478c-e9e2-0dbc0fa8a37b"
+      },
+      "execution_count": 12,
+      "outputs": [
+        {
+          "output_type": "stream",
+          "name": "stdout",
+          "text": [
+            "269.3023401896159\n"
+          ]
+        }
+      ]
+    },
+    {
+      "cell_type": "code",
+      "source": [
+        "starter, ender = torch.cuda.Event(enable_timing=True), torch.cuda.Event(enable_timing=True)\n",
+        "timings=np.zeros((repetitions,1))\n",
+        "\n",
+        "compiled_model = torch.compile(model, mode=\"reduce-overhead\").to(\"cuda\")\n",
+        "\n",
+        "for _ in range(30):\n",
+        "  with torch.no_grad():\n",
+        "    _ = compiled_model(**inputs)\n",
+        "\n",
+        "\n",
+        "with torch.no_grad():\n",
+        "    for rep in range(repetitions):\n",
+        "        torch.cuda.synchronize()\n",
+        "        starter.record()\n",
+        "        output = compiled_model(**inputs)\n",
+        "        ender.record()\n",
+        "        torch.cuda.synchronize()\n",
+        "        curr_time = starter.elapsed_time(ender)\n",
+        "        timings[rep] = curr_time\n",
+        "\n",
+        "mean_syn = np.sum(timings) / repetitions\n",
+        "print(mean_syn)"
+      ],
+      "metadata": {
+        "colab": {
+          "base_uri": "https://localhost:8080/"
+        },
+        "id": "OuQZmgTK7UCo",
+        "outputId": "7184eb1d-b545-4bb6-b544-3effd5c2545a"
+      },
+      "execution_count": 13,
+      "outputs": [
+        {
+          "output_type": "stream",
+          "name": "stdout",
+          "text": [
+            "159.77137603759766\n"
+          ]
+        }
+      ]
+    }
+  ]
+}

Fine_tune_Florence_2.ipynb ADDED Viewed

The diff for this file is too large to render. See raw diff

Fine_tune_PaliGemma.ipynb ADDED Viewed

	@@ -0,0 +1,1846 @@

+{
+  "cells": [
+    {
+      "cell_type": "markdown",
+      "metadata": {
+        "id": "view-in-github",
+        "colab_type": "text"
+      },
+      "source": [
+        "<a href=\"https://colab.research.google.com/github/merveenoyan/smol-vision/blob/main/Fine_tune_PaliGemma.ipynb\" target=\"_parent\"><img src=\"https://colab.research.google.com/assets/colab-badge.svg\" alt=\"Open In Colab\"/></a>"
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "metadata": {
+        "id": "m8t6tkjuuONX"
+      },
+      "source": [
+        "## PaliGemma Fine-tuning\n",
+        "\n",
+        "In this notebook, we will fine-tune [pretrained PaliGemma](https://huggingface.co/google/paligemma2-3b-pt-448) on a small split of [VQAv2](https://huggingface.co/datasets/HuggingFaceM4/VQAv2) dataset. Let's get started by installing necessary libraries."
+      ]
+    },
+    {
+      "cell_type": "code",
+      "source": [
+        "!pip install -q -U datasets bitsandbytes peft git+https://github.com/huggingface/transformers.git"
+      ],
+      "metadata": {
+        "id": "EB0gv8OzHfLV",
+        "colab": {
+          "base_uri": "https://localhost:8080/"
+        },
+        "outputId": "9de07e75-ddf4-4347-fc41-432a23774e2c"
+      },
+      "execution_count": 1,
+      "outputs": [
+        {
+          "output_type": "stream",
+          "name": "stdout",
+          "text": [
+            "  Installing build dependencies ... \u001b[?25l\u001b[?25hdone\n",
+            "  Getting requirements to build wheel ... \u001b[?25l\u001b[?25hdone\n",
+            "  Preparing metadata (pyproject.toml) ... \u001b[?25l\u001b[?25hdone\n",
+            "\u001b[2K   \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m480.6/480.6 kB\u001b[0m \u001b[31m25.9 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n",
+            "\u001b[2K   \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m69.1/69.1 MB\u001b[0m \u001b[31m28.6 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n",
+            "\u001b[2K   \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m116.3/116.3 kB\u001b[0m \u001b[31m7.0 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n",
+            "\u001b[2K   \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m179.3/179.3 kB\u001b[0m \u001b[31m14.5 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n",
+            "\u001b[2K   \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m134.8/134.8 kB\u001b[0m \u001b[31m8.9 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n",
+            "\u001b[2K   \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m3.0/3.0 MB\u001b[0m \u001b[31m75.0 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n",
+            "\u001b[2K   \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m194.1/194.1 kB\u001b[0m \u001b[31m17.7 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n",
+            "\u001b[?25h  Building wheel for transformers (pyproject.toml) ... \u001b[?25l\u001b[?25hdone\n",
+            "\u001b[31mERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.\n",
+            "gcsfs 2024.10.0 requires fsspec==2024.10.0, but you have fsspec 2024.9.0 which is incompatible.\u001b[0m\u001b[31m\n",
+            "\u001b[0m"
+          ]
+        }
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "metadata": {
+        "id": "q_85okyYt1eo"
+      },
+      "source": [
+        "We will authenticate to access the model using `notebook_login()`."
+      ]
+    },
+    {
+      "cell_type": "code",
+      "execution_count": 2,
+      "metadata": {
+        "id": "NzJZSHD8tZZy",
+        "colab": {
+          "base_uri": "https://localhost:8080/",
+          "height": 17,
+          "referenced_widgets": [
+            "4f0e85aa740146d3aca81588a0288031",
+            "c7fcb9dd46e649c4b8bd967b69bdb867",
+            "c3fad0f1cb954317a20ee158f7e10363",
+            "3deca9286f89422aa691325b39347b0b",
+            "ca1c290bfb654f1190bbde68d51167f1",
+            "2d8493a60b7a42c1b25ec0bbe0a59043",
+            "c25efe32ee7c40d3a4c95093abb2a720",
+            "55c01e2c04d1499ca5b9b19dea7e4e02",
+            "bf9da831d7ad4651a262c5e7f80bbf87",
+            "ed2d3d1a700143d2a48e9a9b13bd1200",
+            "40782cfc43a8437da5534feee03c6ba6",
+            "b6fac3155dd140bc8e1b010270bc3cc2",
+            "ca348c721475417582ed5018ed43151f",
+            "3f07afac7c194db7a16167d177562a46",
+            "5515d96f0c8947f0ad4b7f17eb7d63f6",
+            "d703de12cf9d4f87aa6ec2cc52f1090a",
+            "757bc788bd6842d28a9f889187ffb88e",
+            "65f10d2456cb4ee1963fac050e4c34f7",
+            "9335e48fe8ba4fe9b535b5ece1be6ff5",
+            "80df5f3cd6c646808b09d99daed5bfd2"
+          ]
+        },
+        "outputId": "c01b2b6f-3c1e-45da-9fc0-f4f518bcca24"
+      },
+      "outputs": [
+        {
+          "output_type": "display_data",
+          "data": {
+            "text/plain": [
+              "VBox(children=(HTML(value='<center> <img\\nsrc=https://huggingface.co/front/assets/huggingface_logo-noborder.sv…"
+            ],
+            "application/vnd.jupyter.widget-view+json": {
+              "version_major": 2,
+              "version_minor": 0,
+              "model_id": "4f0e85aa740146d3aca81588a0288031"
+            }
+          },
+          "metadata": {}
+        }
+      ],
+      "source": [
+        "from huggingface_hub import notebook_login\n",
+        "notebook_login()"
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "metadata": {
+        "id": "9_jUBDTEuw1j"
+      },
+      "source": [
+        "Let's load the dataset."
+      ]
+    },
+    {
+      "cell_type": "code",
+      "execution_count": 1,
+      "metadata": {
+        "id": "az5kdSbNpjgH",
+        "colab": {
+          "base_uri": "https://localhost:8080/"
+        },
+        "outputId": "2d9f379c-eb31-45b0-b84c-79c2a2577d01"
+      },
+      "outputs": [
+        {
+          "output_type": "stream",
+          "name": "stderr",
+          "text": [
+            "/usr/local/lib/python3.10/dist-packages/huggingface_hub/utils/_auth.py:94: UserWarning: \n",
+            "The secret `HF_TOKEN` does not exist in your Colab secrets.\n",
+            "To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.\n",
+            "You will be able to reuse this secret in all of your notebooks.\n",
+            "Please note that authentication is recommended but still optional to access public models or datasets.\n",
+            "  warnings.warn(\n"
+          ]
+        }
+      ],
+      "source": [
+        "from datasets import load_dataset\n",
+        "ds = load_dataset('merve/vqav2-small', split=\"validation\")\n"
+      ]
+    },
+    {
+      "cell_type": "code",
+      "execution_count": 2,
+      "metadata": {
+        "id": "wN1c9Aqhqt47"
+      },
+      "outputs": [],
+      "source": [
+        "split_ds = ds.train_test_split(test_size=0.9) # we'll use a very small split for demo\n",
+        "train_ds = split_ds[\"test\"]"
+      ]
+    },
+    {
+      "cell_type": "code",
+      "execution_count": 3,
+      "metadata": {
+        "id": "TNJW2ty4yy4L",
+        "colab": {
+          "base_uri": "https://localhost:8080/"
+        },
+        "outputId": "f76414b2-8f37-48ae-d369-b977323fa892"
+      },
+      "outputs": [
+        {
+          "output_type": "execute_result",
+          "data": {
+            "text/plain": [
+              "Dataset({\n",
+              "    features: ['multiple_choice_answer', 'question', 'image'],\n",
+              "    num_rows: 19292\n",
+              "})"
+            ]
+          },
+          "metadata": {},
+          "execution_count": 3
+        }
+      ],
+      "source": [
+        "train_ds"
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "metadata": {
+        "id": "Hi_Y1blXwA04"
+      },
+      "source": [
+        "Our dataset is a very general one and similar to many datasets that PaliGemma was trained with. In this case, we do not need to fine-tune the image encoder, the multimodal projector but we will only fine-tune the text decoder."
+      ]
+    },
+    {
+      "cell_type": "code",
+      "execution_count": 4,
+      "metadata": {
+        "id": "Zya_PWM3uBWs"
+      },
+      "outputs": [],
+      "source": [
+        "from transformers import PaliGemmaProcessor\n",
+        "model_id =\"google/paligemma2-3b-pt-224\" # or your favorite PaliGemma"
+      ]
+    },
+    {
+      "cell_type": "code",
+      "execution_count": 5,
+      "metadata": {
+        "id": "iZRvrfUquH1y",
+        "colab": {
+          "base_uri": "https://localhost:8080/",
+          "height": 49,
+          "referenced_widgets": [
+            "8458933373264dbeb58d0b5ace4fd9c6",
+            "714009484da745dc8a87e5066b939de2",
+            "e43e970ce8ba477e83081a4c7fea05f5",
+            "7138aa9537fc4b4f809e57665be87139",
+            "46810cc7c7c54e31a65e609c386d86d9",
+            "cfed7deef0b74f4b9d160e9fdc2b138e",
+            "23ddab24ac304751b3babfaeec9360eb",
+            "79e87175ffb949bd8cddf4577210a42d",
+            "5aed84a20ac34f2b943d26d66decc88f",
+            "3ca0e1427ac6477c9921929af7ff00d1",
+            "a9a5503caf384b93bf987e5271a577d2"
+          ]
+        },
+        "outputId": "34f12289-6ef4-49d9-9257-ad0328961190"
+      },
+      "outputs": [
+        {
+          "output_type": "display_data",
+          "data": {
+            "text/plain": [
+              "Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]"
+            ],
+            "application/vnd.jupyter.widget-view+json": {
+              "version_major": 2,
+              "version_minor": 0,
+              "model_id": "8458933373264dbeb58d0b5ace4fd9c6"
+            }
+          },
+          "metadata": {}
+        }
+      ],
+      "source": [
+        "from transformers import PaliGemmaForConditionalGeneration\n",
+        "import torch\n",
+        "device = \"cuda\"\n",
+        "model = PaliGemmaForConditionalGeneration.from_pretrained(model_id, torch_dtype=torch.bfloat16).to(device)\n",
+        "\n",
+        "for param in model.vision_tower.parameters():\n",
+        "    param.requires_grad = False\n",
+        "\n",
+        "for param in model.multi_modal_projector.parameters():\n",
+        "    param.requires_grad = False\n"
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "metadata": {
+        "id": "uCiVI-xUwSJm"
+      },
+      "source": [
+        "Alternatively, if you want to do LoRA and QLoRA fine-tuning, you can run below cells to load the adapter either in full precision or quantized."
+      ]
+    },
+    {
+      "cell_type": "code",
+      "execution_count": 6,
+      "metadata": {
+        "id": "9AYeuyzNuJ9X",
+        "colab": {
+          "base_uri": "https://localhost:8080/",
+          "height": 66,
+          "referenced_widgets": [
+            "c68f0fe7a6bb4060afcb05e3f6422288",
+            "fef3c94897fc4ffa86f91aac7a45ac7f",
+            "92881d2e3f1a438b92a389cc6022f7ad",
+            "f518ab021bc648f188638fd168879edd",
+            "1a29c71234d74f08b2645f9383fee126",
+            "f8553ec713ea440eb0208a1012547988",
+            "25e0373512b747ba8ebe020b8b8ab932",
+            "daff4ba27c68441395aa5377111f30f1",
+            "863090b3318e4e0186bd46d3d1479de4",
+            "acae1751ff5d4293bb588c2d9c7ab851",
+            "8859eb8d9c154cb79a302db1568768fa"
+          ]
+        },
+        "outputId": "aaedd707-f694-4ba8-ba43-7ae2a3739e73"
+      },
+      "outputs": [
+        {
+          "output_type": "display_data",
+          "data": {
+            "text/plain": [
+              "Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]"
+            ],
+            "application/vnd.jupyter.widget-view+json": {
+              "version_major": 2,
+              "version_minor": 0,
+              "model_id": "c68f0fe7a6bb4060afcb05e3f6422288"
+            }
+          },
+          "metadata": {}
+        },
+        {
+          "output_type": "stream",
+          "name": "stdout",
+          "text": [
+            "trainable params: 11,876,352 || all params: 3,044,118,768 || trainable%: 0.3901\n"
+          ]
+        }
+      ],
+      "source": [
+        "from transformers import BitsAndBytesConfig, PaliGemmaForConditionalGeneration\n",
+        "from peft import get_peft_model, LoraConfig\n",
+        "\n",
+        "bnb_config = BitsAndBytesConfig(load_in_4bit=True, bnb_4bit_compute_dtype=torch.bfloat16)\n",
+        "\n",
+        "lora_config = LoraConfig(\n",
+        "    r=8,\n",
+        "    target_modules=[\"q_proj\", \"o_proj\", \"k_proj\", \"v_proj\", \"gate_proj\", \"up_proj\", \"down_proj\"],\n",
+        "    task_type=\"CAUSAL_LM\",\n",
+        ")\n",
+        "\n",
+        "model = PaliGemmaForConditionalGeneration.from_pretrained(model_id, device_map=\"auto\")#, quantization_config=bnb_config)\n",
+        "model = get_peft_model(model, lora_config)\n",
+        "model.print_trainable_parameters()\n",
+        "#trainable params: 11,298,816 || all params: 2,934,634,224 || trainable%: 0.38501616002417344\n"
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "source": [
+        "We need to take tokens to same dtype as model so need to store it as a variable."
+      ],
+      "metadata": {
+        "id": "sfxtN1iKRWXX"
+      }
+    },
+    {
+      "cell_type": "code",
+      "source": [
+        "DTYPE = model.dtype"
+      ],
+      "metadata": {
+        "id": "uGZ6FnioRWEc"
+      },
+      "execution_count": 7,
+      "outputs": []
+    },
+    {
+      "cell_type": "markdown",
+      "metadata": {
+        "id": "OsquATWQu2lJ"
+      },
+      "source": [
+        "Load the processor to preprocess the dataset."
+      ]
+    },
+    {
+      "cell_type": "code",
+      "source": [
+        "processor = PaliGemmaProcessor.from_pretrained(model_id)"
+      ],
+      "metadata": {
+        "id": "wQ_gbnXARKz1"
+      },
+      "execution_count": 8,
+      "outputs": []
+    },
+    {
+      "cell_type": "markdown",
+      "metadata": {
+        "id": "QZROnV-pu7rt"
+      },
+      "source": [
+        "We will preprocess our examples. We need to prepare a prompt template and pass the text input inside, pass it with batches of images to processor. Then we will set the pad tokens and image tokens to -100 to let the model ignore them. We will pass our preprocessed input as labels to make the model learn how to generate responses."
+      ]
+    },
+    {
+      "cell_type": "code",
+      "execution_count": 9,
+      "metadata": {
+        "id": "hdw3uBcNuGmw"
+      },
+      "outputs": [],
+      "source": [
+        "import torch\n",
+        "\n",
+        "image_token = processor.tokenizer.convert_tokens_to_ids(\"<image>\")\n",
+        "def collate_fn(examples):\n",
+        "  texts = [\"<image>answer en \" + example[\"question\"] for example in examples]\n",
+        "  labels= [example['multiple_choice_answer'] for example in examples]\n",
+        "  images = [example[\"image\"].convert(\"RGB\") for example in examples]\n",
+        "  tokens = processor(text=texts, images=images, suffix=labels,\n",
+        "                    return_tensors=\"pt\", padding=\"longest\")\n",
+        "\n",
+        "  tokens = tokens.to(DTYPE).to(device)\n",
+        "  return tokens\n"
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "metadata": {
+        "id": "logv0oLqwbIe"
+      },
+      "source": [
+        "We will now initialize the `TrainingArguments`."
+      ]
+    },
+    {
+      "cell_type": "code",
+      "execution_count": 13,
+      "metadata": {
+        "id": "Il7zKQO9uMPT"
+      },
+      "outputs": [],
+      "source": [
+        "from transformers import TrainingArguments\n",
+        "args=TrainingArguments(\n",
+        "            num_train_epochs=2,\n",
+        "            remove_unused_columns=False,\n",
+        "            per_device_train_batch_size=1,\n",
+        "            gradient_accumulation_steps=4,\n",
+        "            warmup_steps=2,\n",
+        "            learning_rate=2e-5,\n",
+        "            weight_decay=1e-6,\n",
+        "            adam_beta2=0.999,\n",
+        "            logging_steps=100,\n",
+        "            optim=\"adamw_hf\", # you can use paged optimizers like paged_adamw_8bit for QLoRA\n",
+        "            save_strategy=\"steps\",\n",
+        "            save_steps=1000,\n",
+        "            save_total_limit=1,\n",
+        "            output_dir=\"paligemma_vqav2\",\n",
+        "            bf16=True,\n",
+        "            report_to=[\"tensorboard\"],\n",
+        "            dataloader_pin_memory=False\n",
+        "        )\n"
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "metadata": {
+        "id": "8pR0EaGlwrDp"
+      },
+      "source": [
+        "We can now start training."
+      ]
+    },
+    {
+      "cell_type": "code",
+      "execution_count": 14,
+      "metadata": {
+        "id": "CguCGDv1uNkF"
+      },
+      "outputs": [],
+      "source": [
+        "from transformers import Trainer\n",
+        "\n",
+        "trainer = Trainer(\n",
+        "        model=model,\n",
+        "        train_dataset=train_ds ,\n",
+        "        data_collator=collate_fn,\n",
+        "        args=args\n",
+        "        )\n"
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "source": [
+        "LoRA with bsz of 2 works on A100 Colab. You can apply gradient accumulation (which is enabled in this notebook) to simulate larger batch sizes.\n",
+        "Currently there's an issue with QLoRA, we are investigating and will solve soon."
+      ],
+      "metadata": {
+        "id": "ZX912_liP-Eh"
+      }
+    },
+    {
+      "cell_type": "code",
+      "execution_count": null,
+      "metadata": {
+        "id": "9KFPQLrnF2Ha"
+      },
+      "outputs": [],
+      "source": [
+        "trainer.train()"
+      ]
+    },
+    {
+      "cell_type": "code",
+      "execution_count": null,
+      "metadata": {
+        "id": "O9fMDEjXSSzF"
+      },
+      "outputs": [],
+      "source": [
+        "trainer.push_to_hub()"
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "metadata": {
+        "id": "JohfxEJQjLBd"
+      },
+      "source": [
+        "You can find steps to infer [here](https://colab.research.google.com/drive/100IQcvMvGm9y--oelbLfI__eHCoz5Ser?usp=sharing)."
+      ]
+    }
+  ],
+  "metadata": {
+    "accelerator": "GPU",
+    "colab": {
+      "gpuType": "A100",
+      "provenance": [],
+      "include_colab_link": true
+    },
+    "kernelspec": {
+      "display_name": "Python 3",
+      "name": "python3"
+    },
+    "language_info": {
+      "codemirror_mode": {
+        "name": "ipython",
+        "version": 3
+      },
+      "file_extension": ".py",
+      "mimetype": "text/x-python",
+      "name": "python",
+      "nbconvert_exporter": "python",
+      "pygments_lexer": "ipython3",
+      "version": "3.11.3"
+    },
+    "widgets": {
+      "application/vnd.jupyter.widget-state+json": {
+        "4f0e85aa740146d3aca81588a0288031": {
+          "model_module": "@jupyter-widgets/controls",
+          "model_name": "VBoxModel",
+          "model_module_version": "1.5.0",
+          "state": {
+            "_dom_classes": [],
+            "_model_module": "@jupyter-widgets/controls",
+            "_model_module_version": "1.5.0",
+            "_model_name": "VBoxModel",
+            "_view_count": null,
+            "_view_module": "@jupyter-widgets/controls",
+            "_view_module_version": "1.5.0",
+            "_view_name": "VBoxView",
+            "box_style": "",
+            "children": [],
+            "layout": "IPY_MODEL_c25efe32ee7c40d3a4c95093abb2a720"
+          }
+        },
+        "c7fcb9dd46e649c4b8bd967b69bdb867": {
+          "model_module": "@jupyter-widgets/controls",
+          "model_name": "HTMLModel",
+          "model_module_version": "1.5.0",
+          "state": {
+            "_dom_classes": [],
+            "_model_module": "@jupyter-widgets/controls",
+            "_model_module_version": "1.5.0",
+            "_model_name": "HTMLModel",
+            "_view_count": null,
+            "_view_module": "@jupyter-widgets/controls",
+            "_view_module_version": "1.5.0",
+            "_view_name": "HTMLView",
+            "description": "",
+            "description_tooltip": null,
+            "layout": "IPY_MODEL_55c01e2c04d1499ca5b9b19dea7e4e02",
+            "placeholder": "",
+            "style": "IPY_MODEL_bf9da831d7ad4651a262c5e7f80bbf87",
+            "value": "<center> <img\nsrc=https://huggingface.co/front/assets/huggingface_logo-noborder.svg\nalt='Hugging Face'> <br> Copy a token from <a\nhref=\"https://huggingface.co/settings/tokens\" target=\"_blank\">your Hugging Face\ntokens page</a> and paste it below. <br> Immediately click login after copying\nyour token or it might be stored in plain text in this notebook file. </center>"
+          }
+        },
+        "c3fad0f1cb954317a20ee158f7e10363": {
+          "model_module": "@jupyter-widgets/controls",
+          "model_name": "PasswordModel",
+          "model_module_version": "1.5.0",
+          "state": {
+            "_dom_classes": [],
+            "_model_module": "@jupyter-widgets/controls",
+            "_model_module_version": "1.5.0",
+            "_model_name": "PasswordModel",
+            "_view_count": null,
+            "_view_module": "@jupyter-widgets/controls",
+            "_view_module_version": "1.5.0",
+            "_view_name": "PasswordView",
+            "continuous_update": true,
+            "description": "Token:",
+            "description_tooltip": null,
+            "disabled": false,
+            "layout": "IPY_MODEL_ed2d3d1a700143d2a48e9a9b13bd1200",
+            "placeholder": "",
+            "style": "IPY_MODEL_40782cfc43a8437da5534feee03c6ba6",
+            "value": ""
+          }
+        },
+        "3deca9286f89422aa691325b39347b0b": {
+          "model_module": "@jupyter-widgets/controls",
+          "model_name": "CheckboxModel",
+          "model_module_version": "1.5.0",
+          "state": {
+            "_dom_classes": [],
+            "_model_module": "@jupyter-widgets/controls",
+            "_model_module_version": "1.5.0",
+            "_model_name": "CheckboxModel",
+            "_view_count": null,
+            "_view_module": "@jupyter-widgets/controls",
+            "_view_module_version": "1.5.0",
+            "_view_name": "CheckboxView",
+            "description": "Add token as git credential?",
+            "description_tooltip": null,
+            "disabled": false,
+            "indent": true,
+            "layout": "IPY_MODEL_b6fac3155dd140bc8e1b010270bc3cc2",
+            "style": "IPY_MODEL_ca348c721475417582ed5018ed43151f",
+            "value": true
+          }
+        },
+        "ca1c290bfb654f1190bbde68d51167f1": {
+          "model_module": "@jupyter-widgets/controls",
+          "model_name": "ButtonModel",
+          "model_module_version": "1.5.0",
+          "state": {
+            "_dom_classes": [],
+            "_model_module": "@jupyter-widgets/controls",
+            "_model_module_version": "1.5.0",
+            "_model_name": "ButtonModel",
+            "_view_count": null,
+            "_view_module": "@jupyter-widgets/controls",
+            "_view_module_version": "1.5.0",
+            "_view_name": "ButtonView",
+            "button_style": "",
+            "description": "Login",
+            "disabled": false,
+            "icon": "",
+            "layout": "IPY_MODEL_3f07afac7c194db7a16167d177562a46",
+            "style": "IPY_MODEL_5515d96f0c8947f0ad4b7f17eb7d63f6",
+            "tooltip": ""
+          }
+        },
+        "2d8493a60b7a42c1b25ec0bbe0a59043": {
+          "model_module": "@jupyter-widgets/controls",
+          "model_name": "HTMLModel",
+          "model_module_version": "1.5.0",
+          "state": {
+            "_dom_classes": [],
+            "_model_module": "@jupyter-widgets/controls",
+            "_model_module_version": "1.5.0",
+            "_model_name": "HTMLModel",
+            "_view_count": null,
+            "_view_module": "@jupyter-widgets/controls",
+            "_view_module_version": "1.5.0",
+            "_view_name": "HTMLView",
+            "description": "",
+            "description_tooltip": null,
+            "layout": "IPY_MODEL_d703de12cf9d4f87aa6ec2cc52f1090a",
+            "placeholder": "",
+            "style": "IPY_MODEL_757bc788bd6842d28a9f889187ffb88e",
+            "value": "\n<b>Pro Tip:</b> If you don't already have one, you can create a dedicated\n'notebooks' token with 'write' access, that you can then easily reuse for all\nnotebooks. </center>"
+          }
+        },
+        "c25efe32ee7c40d3a4c95093abb2a720": {
+          "model_module": "@jupyter-widgets/base",
+          "model_name": "LayoutModel",
+          "model_module_version": "1.2.0",
+          "state": {
+            "_model_module": "@jupyter-widgets/base",
+            "_model_module_version": "1.2.0",
+            "_model_name": "LayoutModel",
+            "_view_count": null,
+            "_view_module": "@jupyter-widgets/base",
+            "_view_module_version": "1.2.0",
+            "_view_name": "LayoutView",
+            "align_content": null,
+            "align_items": "center",
+            "align_self": null,
+            "border": null,
+            "bottom": null,
+            "display": "flex",
+            "flex": null,
+            "flex_flow": "column",
+            "grid_area": null,
+            "grid_auto_columns": null,
+            "grid_auto_flow": null,
+            "grid_auto_rows": null,
+            "grid_column": null,
+            "grid_gap": null,
+            "grid_row": null,
+            "grid_template_areas": null,
+            "grid_template_columns": null,
+            "grid_template_rows": null,
+            "height": null,
+            "justify_content": null,
+            "justify_items": null,
+            "left": null,
+            "margin": null,
+            "max_height": null,
+            "max_width": null,
+            "min_height": null,
+            "min_width": null,
+            "object_fit": null,
+            "object_position": null,
+            "order": null,
+            "overflow": null,
+            "overflow_x": null,
+            "overflow_y": null,
+            "padding": null,
+            "right": null,
+            "top": null,
+            "visibility": null,
+            "width": "50%"
+          }
+        },
+        "55c01e2c04d1499ca5b9b19dea7e4e02": {
+          "model_module": "@jupyter-widgets/base",
+          "model_name": "LayoutModel",
+          "model_module_version": "1.2.0",
+          "state": {
+            "_model_module": "@jupyter-widgets/base",
+            "_model_module_version": "1.2.0",
+            "_model_name": "LayoutModel",
+            "_view_count": null,
+            "_view_module": "@jupyter-widgets/base",
+            "_view_module_version": "1.2.0",
+            "_view_name": "LayoutView",
+            "align_content": null,
+            "align_items": null,
+            "align_self": null,
+            "border": null,
+            "bottom": null,
+            "display": null,
+            "flex": null,
+            "flex_flow": null,
+            "grid_area": null,
+            "grid_auto_columns": null,
+            "grid_auto_flow": null,
+            "grid_auto_rows": null,
+            "grid_column": null,
+            "grid_gap": null,
+            "grid_row": null,
+            "grid_template_areas": null,
+            "grid_template_columns": null,
+            "grid_template_rows": null,
+            "height": null,
+            "justify_content": null,
+            "justify_items": null,
+            "left": null,
+            "margin": null,
+            "max_height": null,
+            "max_width": null,
+            "min_height": null,
+            "min_width": null,
+            "object_fit": null,
+            "object_position": null,
+            "order": null,
+            "overflow": null,
+            "overflow_x": null,
+            "overflow_y": null,
+            "padding": null,
+            "right": null,
+            "top": null,
+            "visibility": null,
+            "width": null
+          }
+        },
+        "bf9da831d7ad4651a262c5e7f80bbf87": {
+          "model_module": "@jupyter-widgets/controls",
+          "model_name": "DescriptionStyleModel",
+          "model_module_version": "1.5.0",
+          "state": {
+            "_model_module": "@jupyter-widgets/controls",
+            "_model_module_version": "1.5.0",
+            "_model_name": "DescriptionStyleModel",
+            "_view_count": null,
+            "_view_module": "@jupyter-widgets/base",
+            "_view_module_version": "1.2.0",
+            "_view_name": "StyleView",
+            "description_width": ""
+          }
+        },
+        "ed2d3d1a700143d2a48e9a9b13bd1200": {
+          "model_module": "@jupyter-widgets/base",
+          "model_name": "LayoutModel",
+          "model_module_version": "1.2.0",
+          "state": {
+            "_model_module": "@jupyter-widgets/base",
+            "_model_module_version": "1.2.0",
+            "_model_name": "LayoutModel",
+            "_view_count": null,
+            "_view_module": "@jupyter-widgets/base",
+            "_view_module_version": "1.2.0",
+            "_view_name": "LayoutView",
+            "align_content": null,
+            "align_items": null,
+            "align_self": null,
+            "border": null,
+            "bottom": null,
+            "display": null,
+            "flex": null,
+            "flex_flow": null,
+            "grid_area": null,
+            "grid_auto_columns": null,
+            "grid_auto_flow": null,
+            "grid_auto_rows": null,
+            "grid_column": null,
+            "grid_gap": null,
+            "grid_row": null,
+            "grid_template_areas": null,
+            "grid_template_columns": null,
+            "grid_template_rows": null,
+            "height": null,
+            "justify_content": null,
+            "justify_items": null,
+            "left": null,
+            "margin": null,
+            "max_height": null,
+            "max_width": null,
+            "min_height": null,
+            "min_width": null,
+            "object_fit": null,
+            "object_position": null,
+            "order": null,
+            "overflow": null,
+            "overflow_x": null,
+            "overflow_y": null,
+            "padding": null,
+            "right": null,
+            "top": null,
+            "visibility": null,
+            "width": null
+          }
+        },
+        "40782cfc43a8437da5534feee03c6ba6": {
+          "model_module": "@jupyter-widgets/controls",
+          "model_name": "DescriptionStyleModel",
+          "model_module_version": "1.5.0",
+          "state": {
+            "_model_module": "@jupyter-widgets/controls",
+            "_model_module_version": "1.5.0",
+            "_model_name": "DescriptionStyleModel",
+            "_view_count": null,
+            "_view_module": "@jupyter-widgets/base",
+            "_view_module_version": "1.2.0",
+            "_view_name": "StyleView",
+            "description_width": ""
+          }
+        },
+        "b6fac3155dd140bc8e1b010270bc3cc2": {
+          "model_module": "@jupyter-widgets/base",
+          "model_name": "LayoutModel",
+          "model_module_version": "1.2.0",
+          "state": {
+            "_model_module": "@jupyter-widgets/base",
+            "_model_module_version": "1.2.0",
+            "_model_name": "LayoutModel",
+            "_view_count": null,
+            "_view_module": "@jupyter-widgets/base",
+            "_view_module_version": "1.2.0",
+            "_view_name": "LayoutView",
+            "align_content": null,
+            "align_items": null,
+            "align_self": null,
+            "border": null,
+            "bottom": null,
+            "display": null,
+            "flex": null,
+            "flex_flow": null,
+            "grid_area": null,
+            "grid_auto_columns": null,
+            "grid_auto_flow": null,
+            "grid_auto_rows": null,
+            "grid_column": null,
+            "grid_gap": null,
+            "grid_row": null,
+            "grid_template_areas": null,
+            "grid_template_columns": null,
+            "grid_template_rows": null,
+            "height": null,
+            "justify_content": null,
+            "justify_items": null,
+            "left": null,
+            "margin": null,
+            "max_height": null,
+            "max_width": null,
+            "min_height": null,
+            "min_width": null,
+            "object_fit": null,
+            "object_position": null,
+            "order": null,
+            "overflow": null,
+            "overflow_x": null,
+            "overflow_y": null,
+            "padding": null,
+            "right": null,
+            "top": null,
+            "visibility": null,
+            "width": null
+          }
+        },
+        "ca348c721475417582ed5018ed43151f": {
+          "model_module": "@jupyter-widgets/controls",
+          "model_name": "DescriptionStyleModel",
+          "model_module_version": "1.5.0",
+          "state": {
+            "_model_module": "@jupyter-widgets/controls",
+            "_model_module_version": "1.5.0",
+            "_model_name": "DescriptionStyleModel",
+            "_view_count": null,
+            "_view_module": "@jupyter-widgets/base",
+            "_view_module_version": "1.2.0",
+            "_view_name": "StyleView",
+            "description_width": ""
+          }
+        },
+        "3f07afac7c194db7a16167d177562a46": {
+          "model_module": "@jupyter-widgets/base",
+          "model_name": "LayoutModel",
+          "model_module_version": "1.2.0",
+          "state": {
+            "_model_module": "@jupyter-widgets/base",
+            "_model_module_version": "1.2.0",
+            "_model_name": "LayoutModel",
+            "_view_count": null,
+            "_view_module": "@jupyter-widgets/base",
+            "_view_module_version": "1.2.0",
+            "_view_name": "LayoutView",
+            "align_content": null,
+            "align_items": null,
+            "align_self": null,
+            "border": null,
+            "bottom": null,
+            "display": null,
+            "flex": null,
+            "flex_flow": null,
+            "grid_area": null,
+            "grid_auto_columns": null,
+            "grid_auto_flow": null,
+            "grid_auto_rows": null,
+            "grid_column": null,
+            "grid_gap": null,
+            "grid_row": null,
+            "grid_template_areas": null,
+            "grid_template_columns": null,
+            "grid_template_rows": null,
+            "height": null,
+            "justify_content": null,
+            "justify_items": null,
+            "left": null,
+            "margin": null,
+            "max_height": null,
+            "max_width": null,
+            "min_height": null,
+            "min_width": null,
+            "object_fit": null,
+            "object_position": null,
+            "order": null,
+            "overflow": null,
+            "overflow_x": null,
+            "overflow_y": null,
+            "padding": null,
+            "right": null,
+            "top": null,
+            "visibility": null,
+            "width": null
+          }
+        },
+        "5515d96f0c8947f0ad4b7f17eb7d63f6": {
+          "model_module": "@jupyter-widgets/controls",
+          "model_name": "ButtonStyleModel",
+          "model_module_version": "1.5.0",
+          "state": {
+            "_model_module": "@jupyter-widgets/controls",
+            "_model_module_version": "1.5.0",
+            "_model_name": "ButtonStyleModel",
+            "_view_count": null,
+            "_view_module": "@jupyter-widgets/base",
+            "_view_module_version": "1.2.0",
+            "_view_name": "StyleView",
+            "button_color": null,
+            "font_weight": ""
+          }
+        },
+        "d703de12cf9d4f87aa6ec2cc52f1090a": {
+          "model_module": "@jupyter-widgets/base",
+          "model_name": "LayoutModel",
+          "model_module_version": "1.2.0",
+          "state": {
+            "_model_module": "@jupyter-widgets/base",
+            "_model_module_version": "1.2.0",
+            "_model_name": "LayoutModel",
+            "_view_count": null,
+            "_view_module": "@jupyter-widgets/base",
+            "_view_module_version": "1.2.0",
+            "_view_name": "LayoutView",
+            "align_content": null,
+            "align_items": null,
+            "align_self": null,
+            "border": null,
+            "bottom": null,
+            "display": null,
+            "flex": null,
+            "flex_flow": null,
+            "grid_area": null,
+            "grid_auto_columns": null,
+            "grid_auto_flow": null,
+            "grid_auto_rows": null,
+            "grid_column": null,
+            "grid_gap": null,
+            "grid_row": null,
+            "grid_template_areas": null,
+            "grid_template_columns": null,
+            "grid_template_rows": null,
+            "height": null,
+            "justify_content": null,
+            "justify_items": null,
+            "left": null,
+            "margin": null,
+            "max_height": null,
+            "max_width": null,
+            "min_height": null,
+            "min_width": null,
+            "object_fit": null,
+            "object_position": null,
+            "order": null,
+            "overflow": null,
+            "overflow_x": null,
+            "overflow_y": null,
+            "padding": null,
+            "right": null,
+            "top": null,
+            "visibility": null,
+            "width": null
+          }
+        },
+        "757bc788bd6842d28a9f889187ffb88e": {
+          "model_module": "@jupyter-widgets/controls",
+          "model_name": "DescriptionStyleModel",
+          "model_module_version": "1.5.0",
+          "state": {
+            "_model_module": "@jupyter-widgets/controls",
+            "_model_module_version": "1.5.0",
+            "_model_name": "DescriptionStyleModel",
+            "_view_count": null,
+            "_view_module": "@jupyter-widgets/base",
+            "_view_module_version": "1.2.0",
+            "_view_name": "StyleView",
+            "description_width": ""
+          }
+        },
+        "65f10d2456cb4ee1963fac050e4c34f7": {
+          "model_module": "@jupyter-widgets/controls",
+          "model_name": "LabelModel",
+          "model_module_version": "1.5.0",
+          "state": {
+            "_dom_classes": [],
+            "_model_module": "@jupyter-widgets/controls",
+            "_model_module_version": "1.5.0",
+            "_model_name": "LabelModel",
+            "_view_count": null,
+            "_view_module": "@jupyter-widgets/controls",
+            "_view_module_version": "1.5.0",
+            "_view_name": "LabelView",
+            "description": "",
+            "description_tooltip": null,
+            "layout": "IPY_MODEL_9335e48fe8ba4fe9b535b5ece1be6ff5",
+            "placeholder": "",
+            "style": "IPY_MODEL_80df5f3cd6c646808b09d99daed5bfd2",
+            "value": "Connecting..."
+          }
+        },
+        "9335e48fe8ba4fe9b535b5ece1be6ff5": {
+          "model_module": "@jupyter-widgets/base",
+          "model_name": "LayoutModel",
+          "model_module_version": "1.2.0",
+          "state": {
+            "_model_module": "@jupyter-widgets/base",
+            "_model_module_version": "1.2.0",
+            "_model_name": "LayoutModel",
+            "_view_count": null,
+            "_view_module": "@jupyter-widgets/base",
+            "_view_module_version": "1.2.0",
+            "_view_name": "LayoutView",
+            "align_content": null,
+            "align_items": null,
+            "align_self": null,
+            "border": null,
+            "bottom": null,
+            "display": null,
+            "flex": null,
+            "flex_flow": null,
+            "grid_area": null,
+            "grid_auto_columns": null,
+            "grid_auto_flow": null,
+            "grid_auto_rows": null,
+            "grid_column": null,
+            "grid_gap": null,
+            "grid_row": null,
+            "grid_template_areas": null,
+            "grid_template_columns": null,
+            "grid_template_rows": null,
+            "height": null,
+            "justify_content": null,
+            "justify_items": null,
+            "left": null,
+            "margin": null,
+            "max_height": null,
+            "max_width": null,
+            "min_height": null,
+            "min_width": null,
+            "object_fit": null,
+            "object_position": null,
+            "order": null,
+            "overflow": null,
+            "overflow_x": null,
+            "overflow_y": null,
+            "padding": null,
+            "right": null,
+            "top": null,
+            "visibility": null,
+            "width": null
+          }
+        },
+        "80df5f3cd6c646808b09d99daed5bfd2": {
+          "model_module": "@jupyter-widgets/controls",
+          "model_name": "DescriptionStyleModel",
+          "model_module_version": "1.5.0",
+          "state": {
+            "_model_module": "@jupyter-widgets/controls",
+            "_model_module_version": "1.5.0",
+            "_model_name": "DescriptionStyleModel",
+            "_view_count": null,
+            "_view_module": "@jupyter-widgets/base",
+            "_view_module_version": "1.2.0",
+            "_view_name": "StyleView",
+            "description_width": ""
+          }
+        },
+        "8458933373264dbeb58d0b5ace4fd9c6": {
+          "model_module": "@jupyter-widgets/controls",
+          "model_name": "HBoxModel",
+          "model_module_version": "1.5.0",
+          "state": {
+            "_dom_classes": [],
+            "_model_module": "@jupyter-widgets/controls",
+            "_model_module_version": "1.5.0",
+            "_model_name": "HBoxModel",
+            "_view_count": null,
+            "_view_module": "@jupyter-widgets/controls",
+            "_view_module_version": "1.5.0",
+            "_view_name": "HBoxView",
+            "box_style": "",
+            "children": [
+              "IPY_MODEL_714009484da745dc8a87e5066b939de2",
+              "IPY_MODEL_e43e970ce8ba477e83081a4c7fea05f5",
+              "IPY_MODEL_7138aa9537fc4b4f809e57665be87139"
+            ],
+            "layout": "IPY_MODEL_46810cc7c7c54e31a65e609c386d86d9"
+          }
+        },
+        "714009484da745dc8a87e5066b939de2": {
+          "model_module": "@jupyter-widgets/controls",
+          "model_name": "HTMLModel",
+          "model_module_version": "1.5.0",
+          "state": {
+            "_dom_classes": [],
+            "_model_module": "@jupyter-widgets/controls",
+            "_model_module_version": "1.5.0",
+            "_model_name": "HTMLModel",
+            "_view_count": null,
+            "_view_module": "@jupyter-widgets/controls",
+            "_view_module_version": "1.5.0",
+            "_view_name": "HTMLView",
+            "description": "",
+            "description_tooltip": null,
+            "layout": "IPY_MODEL_cfed7deef0b74f4b9d160e9fdc2b138e",
+            "placeholder": "",
+            "style": "IPY_MODEL_23ddab24ac304751b3babfaeec9360eb",
+            "value": "Loading checkpoint shards: 100%"
+          }
+        },
+        "e43e970ce8ba477e83081a4c7fea05f5": {
+          "model_module": "@jupyter-widgets/controls",
+          "model_name": "FloatProgressModel",
+          "model_module_version": "1.5.0",
+          "state": {
+            "_dom_classes": [],
+            "_model_module": "@jupyter-widgets/controls",
+            "_model_module_version": "1.5.0",
+            "_model_name": "FloatProgressModel",
+            "_view_count": null,
+            "_view_module": "@jupyter-widgets/controls",
+            "_view_module_version": "1.5.0",
+            "_view_name": "ProgressView",
+            "bar_style": "success",
+            "description": "",
+            "description_tooltip": null,
+            "layout": "IPY_MODEL_79e87175ffb949bd8cddf4577210a42d",
+            "max": 2,
+            "min": 0,
+            "orientation": "horizontal",
+            "style": "IPY_MODEL_5aed84a20ac34f2b943d26d66decc88f",
+            "value": 2
+          }
+        },
+        "7138aa9537fc4b4f809e57665be87139": {
+          "model_module": "@jupyter-widgets/controls",
+          "model_name": "HTMLModel",
+          "model_module_version": "1.5.0",
+          "state": {
+            "_dom_classes": [],
+            "_model_module": "@jupyter-widgets/controls",
+            "_model_module_version": "1.5.0",
+            "_model_name": "HTMLModel",
+            "_view_count": null,
+            "_view_module": "@jupyter-widgets/controls",
+            "_view_module_version": "1.5.0",
+            "_view_name": "HTMLView",
+            "description": "",
+            "description_tooltip": null,
+            "layout": "IPY_MODEL_3ca0e1427ac6477c9921929af7ff00d1",
+            "placeholder": "",
+            "style": "IPY_MODEL_a9a5503caf384b93bf987e5271a577d2",
+            "value": " 2/2 [00:00&lt;00:00,  2.83it/s]"
+          }
+        },
+        "46810cc7c7c54e31a65e609c386d86d9": {
+          "model_module": "@jupyter-widgets/base",
+          "model_name": "LayoutModel",
+          "model_module_version": "1.2.0",
+          "state": {
+            "_model_module": "@jupyter-widgets/base",
+            "_model_module_version": "1.2.0",
+            "_model_name": "LayoutModel",
+            "_view_count": null,
+            "_view_module": "@jupyter-widgets/base",
+            "_view_module_version": "1.2.0",
+            "_view_name": "LayoutView",
+            "align_content": null,
+            "align_items": null,
+            "align_self": null,
+            "border": null,
+            "bottom": null,
+            "display": null,
+            "flex": null,
+            "flex_flow": null,
+            "grid_area": null,
+            "grid_auto_columns": null,
+            "grid_auto_flow": null,
+            "grid_auto_rows": null,
+            "grid_column": null,
+            "grid_gap": null,
+            "grid_row": null,
+            "grid_template_areas": null,
+            "grid_template_columns": null,
+            "grid_template_rows": null,
+            "height": null,
+            "justify_content": null,
+            "justify_items": null,
+            "left": null,
+            "margin": null,
+            "max_height": null,
+            "max_width": null,
+            "min_height": null,
+            "min_width": null,
+            "object_fit": null,
+            "object_position": null,
+            "order": null,
+            "overflow": null,
+            "overflow_x": null,
+            "overflow_y": null,
+            "padding": null,
+            "right": null,
+            "top": null,
+            "visibility": null,
+            "width": null
+          }
+        },
+        "cfed7deef0b74f4b9d160e9fdc2b138e": {
+          "model_module": "@jupyter-widgets/base",
+          "model_name": "LayoutModel",
+          "model_module_version": "1.2.0",
+          "state": {
+            "_model_module": "@jupyter-widgets/base",
+            "_model_module_version": "1.2.0",
+            "_model_name": "LayoutModel",
+            "_view_count": null,
+            "_view_module": "@jupyter-widgets/base",
+            "_view_module_version": "1.2.0",
+            "_view_name": "LayoutView",
+            "align_content": null,
+            "align_items": null,
+            "align_self": null,
+            "border": null,
+            "bottom": null,
+            "display": null,
+            "flex": null,
+            "flex_flow": null,
+            "grid_area": null,
+            "grid_auto_columns": null,
+            "grid_auto_flow": null,
+            "grid_auto_rows": null,
+            "grid_column": null,
+            "grid_gap": null,
+            "grid_row": null,
+            "grid_template_areas": null,
+            "grid_template_columns": null,
+            "grid_template_rows": null,
+            "height": null,
+            "justify_content": null,
+            "justify_items": null,
+            "left": null,
+            "margin": null,
+            "max_height": null,
+            "max_width": null,
+            "min_height": null,
+            "min_width": null,
+            "object_fit": null,
+            "object_position": null,
+            "order": null,
+            "overflow": null,
+            "overflow_x": null,
+            "overflow_y": null,
+            "padding": null,
+            "right": null,
+            "top": null,
+            "visibility": null,
+            "width": null
+          }
+        },
+        "23ddab24ac304751b3babfaeec9360eb": {
+          "model_module": "@jupyter-widgets/controls",
+          "model_name": "DescriptionStyleModel",
+          "model_module_version": "1.5.0",
+          "state": {
+            "_model_module": "@jupyter-widgets/controls",
+            "_model_module_version": "1.5.0",
+            "_model_name": "DescriptionStyleModel",
+            "_view_count": null,
+            "_view_module": "@jupyter-widgets/base",
+            "_view_module_version": "1.2.0",
+            "_view_name": "StyleView",
+            "description_width": ""
+          }
+        },
+        "79e87175ffb949bd8cddf4577210a42d": {
+          "model_module": "@jupyter-widgets/base",
+          "model_name": "LayoutModel",
+          "model_module_version": "1.2.0",
+          "state": {
+            "_model_module": "@jupyter-widgets/base",
+            "_model_module_version": "1.2.0",
+            "_model_name": "LayoutModel",
+            "_view_count": null,
+            "_view_module": "@jupyter-widgets/base",
+            "_view_module_version": "1.2.0",
+            "_view_name": "LayoutView",
+            "align_content": null,
+            "align_items": null,
+            "align_self": null,
+            "border": null,
+            "bottom": null,
+            "display": null,
+            "flex": null,
+            "flex_flow": null,
+            "grid_area": null,
+            "grid_auto_columns": null,
+            "grid_auto_flow": null,
+            "grid_auto_rows": null,
+            "grid_column": null,
+            "grid_gap": null,
+            "grid_row": null,
+            "grid_template_areas": null,
+            "grid_template_columns": null,
+            "grid_template_rows": null,
+            "height": null,
+            "justify_content": null,
+            "justify_items": null,
+            "left": null,
+            "margin": null,
+            "max_height": null,
+            "max_width": null,
+            "min_height": null,
+            "min_width": null,
+            "object_fit": null,
+            "object_position": null,
+            "order": null,
+            "overflow": null,
+            "overflow_x": null,
+            "overflow_y": null,
+            "padding": null,
+            "right": null,
+            "top": null,
+            "visibility": null,
+            "width": null
+          }
+        },
+        "5aed84a20ac34f2b943d26d66decc88f": {
+          "model_module": "@jupyter-widgets/controls",
+          "model_name": "ProgressStyleModel",
+          "model_module_version": "1.5.0",
+          "state": {
+            "_model_module": "@jupyter-widgets/controls",
+            "_model_module_version": "1.5.0",
+            "_model_name": "ProgressStyleModel",
+            "_view_count": null,
+            "_view_module": "@jupyter-widgets/base",
+            "_view_module_version": "1.2.0",
+            "_view_name": "StyleView",
+            "bar_color": null,
+            "description_width": ""
+          }
+        },
+        "3ca0e1427ac6477c9921929af7ff00d1": {
+          "model_module": "@jupyter-widgets/base",
+          "model_name": "LayoutModel",
+          "model_module_version": "1.2.0",
+          "state": {
+            "_model_module": "@jupyter-widgets/base",
+            "_model_module_version": "1.2.0",
+            "_model_name": "LayoutModel",
+            "_view_count": null,
+            "_view_module": "@jupyter-widgets/base",
+            "_view_module_version": "1.2.0",
+            "_view_name": "LayoutView",
+            "align_content": null,
+            "align_items": null,
+            "align_self": null,
+            "border": null,
+            "bottom": null,
+            "display": null,
+            "flex": null,
+            "flex_flow": null,
+            "grid_area": null,
+            "grid_auto_columns": null,
+            "grid_auto_flow": null,
+            "grid_auto_rows": null,
+            "grid_column": null,
+            "grid_gap": null,
+            "grid_row": null,
+            "grid_template_areas": null,
+            "grid_template_columns": null,
+            "grid_template_rows": null,
+            "height": null,
+            "justify_content": null,
+            "justify_items": null,
+            "left": null,
+            "margin": null,
+            "max_height": null,
+            "max_width": null,
+            "min_height": null,
+            "min_width": null,
+            "object_fit": null,
+            "object_position": null,
+            "order": null,
+            "overflow": null,
+            "overflow_x": null,
+            "overflow_y": null,
+            "padding": null,
+            "right": null,
+            "top": null,
+            "visibility": null,
+            "width": null
+          }
+        },
+        "a9a5503caf384b93bf987e5271a577d2": {
+          "model_module": "@jupyter-widgets/controls",
+          "model_name": "DescriptionStyleModel",
+          "model_module_version": "1.5.0",
+          "state": {
+            "_model_module": "@jupyter-widgets/controls",
+            "_model_module_version": "1.5.0",
+            "_model_name": "DescriptionStyleModel",
+            "_view_count": null,
+            "_view_module": "@jupyter-widgets/base",
+            "_view_module_version": "1.2.0",
+            "_view_name": "StyleView",
+            "description_width": ""
+          }
+        },
+        "c68f0fe7a6bb4060afcb05e3f6422288": {
+          "model_module": "@jupyter-widgets/controls",
+          "model_name": "HBoxModel",
+          "model_module_version": "1.5.0",
+          "state": {
+            "_dom_classes": [],
+            "_model_module": "@jupyter-widgets/controls",
+            "_model_module_version": "1.5.0",
+            "_model_name": "HBoxModel",
+            "_view_count": null,
+            "_view_module": "@jupyter-widgets/controls",
+            "_view_module_version": "1.5.0",
+            "_view_name": "HBoxView",
+            "box_style": "",
+            "children": [
+              "IPY_MODEL_fef3c94897fc4ffa86f91aac7a45ac7f",
+              "IPY_MODEL_92881d2e3f1a438b92a389cc6022f7ad",
+              "IPY_MODEL_f518ab021bc648f188638fd168879edd"
+            ],
+            "layout": "IPY_MODEL_1a29c71234d74f08b2645f9383fee126"
+          }
+        },
+        "fef3c94897fc4ffa86f91aac7a45ac7f": {
+          "model_module": "@jupyter-widgets/controls",
+          "model_name": "HTMLModel",
+          "model_module_version": "1.5.0",
+          "state": {
+            "_dom_classes": [],
+            "_model_module": "@jupyter-widgets/controls",
+            "_model_module_version": "1.5.0",
+            "_model_name": "HTMLModel",
+            "_view_count": null,
+            "_view_module": "@jupyter-widgets/controls",
+            "_view_module_version": "1.5.0",
+            "_view_name": "HTMLView",
+            "description": "",
+            "description_tooltip": null,
+            "layout": "IPY_MODEL_f8553ec713ea440eb0208a1012547988",
+            "placeholder": "",
+            "style": "IPY_MODEL_25e0373512b747ba8ebe020b8b8ab932",
+            "value": "Loading checkpoint shards: 100%"
+          }
+        },
+        "92881d2e3f1a438b92a389cc6022f7ad": {
+          "model_module": "@jupyter-widgets/controls",
+          "model_name": "FloatProgressModel",
+          "model_module_version": "1.5.0",
+          "state": {
+            "_dom_classes": [],
+            "_model_module": "@jupyter-widgets/controls",
+            "_model_module_version": "1.5.0",
+            "_model_name": "FloatProgressModel",
+            "_view_count": null,
+            "_view_module": "@jupyter-widgets/controls",
+            "_view_module_version": "1.5.0",
+            "_view_name": "ProgressView",
+            "bar_style": "success",
+            "description": "",
+            "description_tooltip": null,
+            "layout": "IPY_MODEL_daff4ba27c68441395aa5377111f30f1",
+            "max": 2,
+            "min": 0,
+            "orientation": "horizontal",
+            "style": "IPY_MODEL_863090b3318e4e0186bd46d3d1479de4",
+            "value": 2
+          }
+        },
+        "f518ab021bc648f188638fd168879edd": {
+          "model_module": "@jupyter-widgets/controls",
+          "model_name": "HTMLModel",
+          "model_module_version": "1.5.0",
+          "state": {
+            "_dom_classes": [],
+            "_model_module": "@jupyter-widgets/controls",
+            "_model_module_version": "1.5.0",
+            "_model_name": "HTMLModel",
+            "_view_count": null,
+            "_view_module": "@jupyter-widgets/controls",
+            "_view_module_version": "1.5.0",
+            "_view_name": "HTMLView",
+            "description": "",
+            "description_tooltip": null,
+            "layout": "IPY_MODEL_acae1751ff5d4293bb588c2d9c7ab851",
+            "placeholder": "",
+            "style": "IPY_MODEL_8859eb8d9c154cb79a302db1568768fa",
+            "value": " 2/2 [00:05&lt;00:00,  2.39s/it]"
+          }
+        },
+        "1a29c71234d74f08b2645f9383fee126": {
+          "model_module": "@jupyter-widgets/base",
+          "model_name": "LayoutModel",
+          "model_module_version": "1.2.0",
+          "state": {
+            "_model_module": "@jupyter-widgets/base",
+            "_model_module_version": "1.2.0",
+            "_model_name": "LayoutModel",
+            "_view_count": null,
+            "_view_module": "@jupyter-widgets/base",
+            "_view_module_version": "1.2.0",
+            "_view_name": "LayoutView",
+            "align_content": null,
+            "align_items": null,
+            "align_self": null,
+            "border": null,
+            "bottom": null,
+            "display": null,
+            "flex": null,
+            "flex_flow": null,
+            "grid_area": null,
+            "grid_auto_columns": null,
+            "grid_auto_flow": null,
+            "grid_auto_rows": null,
+            "grid_column": null,
+            "grid_gap": null,
+            "grid_row": null,
+            "grid_template_areas": null,
+            "grid_template_columns": null,
+            "grid_template_rows": null,
+            "height": null,
+            "justify_content": null,
+            "justify_items": null,
+            "left": null,
+            "margin": null,
+            "max_height": null,
+            "max_width": null,
+            "min_height": null,
+            "min_width": null,
+            "object_fit": null,
+            "object_position": null,
+            "order": null,
+            "overflow": null,
+            "overflow_x": null,
+            "overflow_y": null,
+            "padding": null,
+            "right": null,
+            "top": null,
+            "visibility": null,
+            "width": null
+          }
+        },
+        "f8553ec713ea440eb0208a1012547988": {
+          "model_module": "@jupyter-widgets/base",
+          "model_name": "LayoutModel",
+          "model_module_version": "1.2.0",
+          "state": {
+            "_model_module": "@jupyter-widgets/base",
+            "_model_module_version": "1.2.0",
+            "_model_name": "LayoutModel",
+            "_view_count": null,
+            "_view_module": "@jupyter-widgets/base",
+            "_view_module_version": "1.2.0",
+            "_view_name": "LayoutView",
+            "align_content": null,
+            "align_items": null,
+            "align_self": null,
+            "border": null,
+            "bottom": null,
+            "display": null,
+            "flex": null,
+            "flex_flow": null,
+            "grid_area": null,
+            "grid_auto_columns": null,
+            "grid_auto_flow": null,
+            "grid_auto_rows": null,
+            "grid_column": null,
+            "grid_gap": null,
+            "grid_row": null,
+            "grid_template_areas": null,
+            "grid_template_columns": null,
+            "grid_template_rows": null,
+            "height": null,
+            "justify_content": null,
+            "justify_items": null,
+            "left": null,
+            "margin": null,
+            "max_height": null,
+            "max_width": null,
+            "min_height": null,
+            "min_width": null,
+            "object_fit": null,
+            "object_position": null,
+            "order": null,
+            "overflow": null,
+            "overflow_x": null,
+            "overflow_y": null,
+            "padding": null,
+            "right": null,
+            "top": null,
+            "visibility": null,
+            "width": null
+          }
+        },
+        "25e0373512b747ba8ebe020b8b8ab932": {
+          "model_module": "@jupyter-widgets/controls",
+          "model_name": "DescriptionStyleModel",
+          "model_module_version": "1.5.0",
+          "state": {
+            "_model_module": "@jupyter-widgets/controls",
+            "_model_module_version": "1.5.0",
+            "_model_name": "DescriptionStyleModel",
+            "_view_count": null,
+            "_view_module": "@jupyter-widgets/base",
+            "_view_module_version": "1.2.0",
+            "_view_name": "StyleView",
+            "description_width": ""
+          }
+        },
+        "daff4ba27c68441395aa5377111f30f1": {
+          "model_module": "@jupyter-widgets/base",
+          "model_name": "LayoutModel",
+          "model_module_version": "1.2.0",
+          "state": {
+            "_model_module": "@jupyter-widgets/base",
+            "_model_module_version": "1.2.0",
+            "_model_name": "LayoutModel",
+            "_view_count": null,
+            "_view_module": "@jupyter-widgets/base",
+            "_view_module_version": "1.2.0",
+            "_view_name": "LayoutView",
+            "align_content": null,
+            "align_items": null,
+            "align_self": null,
+            "border": null,
+            "bottom": null,
+            "display": null,
+            "flex": null,
+            "flex_flow": null,
+            "grid_area": null,
+            "grid_auto_columns": null,
+            "grid_auto_flow": null,
+            "grid_auto_rows": null,
+            "grid_column": null,
+            "grid_gap": null,
+            "grid_row": null,
+            "grid_template_areas": null,
+            "grid_template_columns": null,
+            "grid_template_rows": null,
+            "height": null,
+            "justify_content": null,
+            "justify_items": null,
+            "left": null,
+            "margin": null,
+            "max_height": null,
+            "max_width": null,
+            "min_height": null,
+            "min_width": null,
+            "object_fit": null,
+            "object_position": null,
+            "order": null,
+            "overflow": null,
+            "overflow_x": null,
+            "overflow_y": null,
+            "padding": null,
+            "right": null,
+            "top": null,
+            "visibility": null,
+            "width": null
+          }
+        },
+        "863090b3318e4e0186bd46d3d1479de4": {
+          "model_module": "@jupyter-widgets/controls",
+          "model_name": "ProgressStyleModel",
+          "model_module_version": "1.5.0",
+          "state": {
+            "_model_module": "@jupyter-widgets/controls",
+            "_model_module_version": "1.5.0",
+            "_model_name": "ProgressStyleModel",
+            "_view_count": null,
+            "_view_module": "@jupyter-widgets/base",
+            "_view_module_version": "1.2.0",
+            "_view_name": "StyleView",
+            "bar_color": null,
+            "description_width": ""
+          }
+        },
+        "acae1751ff5d4293bb588c2d9c7ab851": {
+          "model_module": "@jupyter-widgets/base",
+          "model_name": "LayoutModel",
+          "model_module_version": "1.2.0",
+          "state": {
+            "_model_module": "@jupyter-widgets/base",
+            "_model_module_version": "1.2.0",
+            "_model_name": "LayoutModel",
+            "_view_count": null,
+            "_view_module": "@jupyter-widgets/base",
+            "_view_module_version": "1.2.0",
+            "_view_name": "LayoutView",
+            "align_content": null,
+            "align_items": null,
+            "align_self": null,
+            "border": null,
+            "bottom": null,
+            "display": null,
+            "flex": null,
+            "flex_flow": null,
+            "grid_area": null,
+            "grid_auto_columns": null,
+            "grid_auto_flow": null,
+            "grid_auto_rows": null,
+            "grid_column": null,
+            "grid_gap": null,
+            "grid_row": null,
+            "grid_template_areas": null,
+            "grid_template_columns": null,
+            "grid_template_rows": null,
+            "height": null,
+            "justify_content": null,
+            "justify_items": null,
+            "left": null,
+            "margin": null,
+            "max_height": null,
+            "max_width": null,
+            "min_height": null,
+            "min_width": null,
+            "object_fit": null,
+            "object_position": null,
+            "order": null,
+            "overflow": null,
+            "overflow_x": null,
+            "overflow_y": null,
+            "padding": null,
+            "right": null,
+            "top": null,
+            "visibility": null,
+            "width": null
+          }
+        },
+        "8859eb8d9c154cb79a302db1568768fa": {
+          "model_module": "@jupyter-widgets/controls",
+          "model_name": "DescriptionStyleModel",
+          "model_module_version": "1.5.0",
+          "state": {
+            "_model_module": "@jupyter-widgets/controls",
+            "_model_module_version": "1.5.0",
+            "_model_name": "DescriptionStyleModel",
+            "_view_count": null,
+            "_view_module": "@jupyter-widgets/base",
+            "_view_module_version": "1.2.0",
+            "_view_name": "StyleView",
+            "description_width": ""
+          }
+        }
+      }
+    }
+  },
+  "nbformat": 4,
+  "nbformat_minor": 0
+}

Fine_tune_SmolVLM2_on_Video.ipynb ADDED Viewed

The diff for this file is too large to render. See raw diff

Finetune_ColPali.ipynb ADDED Viewed

The diff for this file is too large to render. See raw diff

Fit_in_vision_models_using_quanto.ipynb ADDED Viewed

The diff for this file is too large to render. See raw diff

Gemma_3_for_Video_Understanding.ipynb ADDED Viewed

The diff for this file is too large to render. See raw diff

Gemma_3n_Video_Vibe_Tests.ipynb ADDED Viewed

	@@ -0,0 +1,1489 @@

+{
+  "nbformat": 4,
+  "nbformat_minor": 0,
+  "metadata": {
+    "colab": {
+      "provenance": [],
+      "machine_shape": "hm",
+      "gpuType": "A100",
+      "include_colab_link": true
+    },
+    "kernelspec": {
+      "name": "python3",
+      "display_name": "Python 3"
+    },
+    "language_info": {
+      "name": "python"
+    },
+    "accelerator": "GPU",
+    "widgets": {
+      "application/vnd.jupyter.widget-state+json": {
+        "542490f74e974451bc44009a6fa174bd": {
+          "model_module": "@jupyter-widgets/controls",
+          "model_name": "VBoxModel",
+          "model_module_version": "1.5.0",
+          "state": {
+            "_dom_classes": [],
+            "_model_module": "@jupyter-widgets/controls",
+            "_model_module_version": "1.5.0",
+            "_model_name": "VBoxModel",
+            "_view_count": null,
+            "_view_module": "@jupyter-widgets/controls",
+            "_view_module_version": "1.5.0",
+            "_view_name": "VBoxView",
+            "box_style": "",
+            "children": [],
+            "layout": "IPY_MODEL_8d0e5abdd7c549f1a66ee198c9fa1430"
+          }
+        },
+        "409f985be1134b468b81136fbdb54408": {
+          "model_module": "@jupyter-widgets/controls",
+          "model_name": "HTMLModel",
+          "model_module_version": "1.5.0",
+          "state": {
+            "_dom_classes": [],
+            "_model_module": "@jupyter-widgets/controls",
+            "_model_module_version": "1.5.0",
+            "_model_name": "HTMLModel",
+            "_view_count": null,
+            "_view_module": "@jupyter-widgets/controls",
+            "_view_module_version": "1.5.0",
+            "_view_name": "HTMLView",
+            "description": "",
+            "description_tooltip": null,
+            "layout": "IPY_MODEL_c72dd3d6a4c246cfa6590c314783c8f0",
+            "placeholder": "",
+            "style": "IPY_MODEL_c0e471e664dd41eab98efe08301ef5e1",
+            "value": "<center> <img\nsrc=https://huggingface.co/front/assets/huggingface_logo-noborder.svg\nalt='Hugging Face'> <br> Copy a token from <a\nhref=\"https://huggingface.co/settings/tokens\" target=\"_blank\">your Hugging Face\ntokens page</a> and paste it below. <br> Immediately click login after copying\nyour token or it might be stored in plain text in this notebook file. </center>"
+          }
+        },
+        "57cb1e931c614980a4147cb125524d7d": {
+          "model_module": "@jupyter-widgets/controls",
+          "model_name": "PasswordModel",
+          "model_module_version": "1.5.0",
+          "state": {
+            "_dom_classes": [],
+            "_model_module": "@jupyter-widgets/controls",
+            "_model_module_version": "1.5.0",
+            "_model_name": "PasswordModel",
+            "_view_count": null,
+            "_view_module": "@jupyter-widgets/controls",
+            "_view_module_version": "1.5.0",
+            "_view_name": "PasswordView",
+            "continuous_update": true,
+            "description": "Token:",
+            "description_tooltip": null,
+            "disabled": false,
+            "layout": "IPY_MODEL_868f63ea9455442d837dc2c422918800",
+            "placeholder": "",
+            "style": "IPY_MODEL_5b7b4707b1bf4159a10bf7e289bde435",
+            "value": ""
+          }
+        },
+        "87dc7aaf52e349a7bb43bb1b8bc137ee": {
+          "model_module": "@jupyter-widgets/controls",
+          "model_name": "CheckboxModel",
+          "model_module_version": "1.5.0",
+          "state": {
+            "_dom_classes": [],
+            "_model_module": "@jupyter-widgets/controls",
+            "_model_module_version": "1.5.0",
+            "_model_name": "CheckboxModel",
+            "_view_count": null,
+            "_view_module": "@jupyter-widgets/controls",
+            "_view_module_version": "1.5.0",
+            "_view_name": "CheckboxView",
+            "description": "Add token as git credential?",
+            "description_tooltip": null,
+            "disabled": false,
+            "indent": true,
+            "layout": "IPY_MODEL_889d0d1ed24e4de2b89896511d008e60",
+            "style": "IPY_MODEL_68fc757825dd44a48ab2383db20958db",
+            "value": true
+          }
+        },
+        "983ed4cb4eea42daa9ae8c0417021a21": {
+          "model_module": "@jupyter-widgets/controls",
+          "model_name": "ButtonModel",
+          "model_module_version": "1.5.0",
+          "state": {
+            "_dom_classes": [],
+            "_model_module": "@jupyter-widgets/controls",
+            "_model_module_version": "1.5.0",
+            "_model_name": "ButtonModel",
+            "_view_count": null,
+            "_view_module": "@jupyter-widgets/controls",
+            "_view_module_version": "1.5.0",
+            "_view_name": "ButtonView",
+            "button_style": "",
+            "description": "Login",
+            "disabled": false,
+            "icon": "",
+            "layout": "IPY_MODEL_cb76f933e6e640d9a688f7838e5fb0b3",
+            "style": "IPY_MODEL_8704264bff4d46c9813ac9acf92da962",
+            "tooltip": ""
+          }
+        },
+        "40c381fd7bb04b43a879044a4e988cc6": {
+          "model_module": "@jupyter-widgets/controls",
+          "model_name": "HTMLModel",
+          "model_module_version": "1.5.0",
+          "state": {
+            "_dom_classes": [],
+            "_model_module": "@jupyter-widgets/controls",
+            "_model_module_version": "1.5.0",
+            "_model_name": "HTMLModel",
+            "_view_count": null,
+            "_view_module": "@jupyter-widgets/controls",
+            "_view_module_version": "1.5.0",
+            "_view_name": "HTMLView",
+            "description": "",
+            "description_tooltip": null,
+            "layout": "IPY_MODEL_9b5d87960dde401baeaf8b6144fb8bad",
+            "placeholder": "",
+            "style": "IPY_MODEL_76e06881e5e94197a24944e07fdf3189",
+            "value": "\n<b>Pro Tip:</b> If you don't already have one, you can create a dedicated\n'notebooks' token with 'write' access, that you can then easily reuse for all\nnotebooks. </center>"
+          }
+        },
+        "8d0e5abdd7c549f1a66ee198c9fa1430": {
+          "model_module": "@jupyter-widgets/base",
+          "model_name": "LayoutModel",
+          "model_module_version": "1.2.0",
+          "state": {
+            "_model_module": "@jupyter-widgets/base",
+            "_model_module_version": "1.2.0",
+            "_model_name": "LayoutModel",
+            "_view_count": null,
+            "_view_module": "@jupyter-widgets/base",
+            "_view_module_version": "1.2.0",
+            "_view_name": "LayoutView",
+            "align_content": null,
+            "align_items": "center",
+            "align_self": null,
+            "border": null,
+            "bottom": null,
+            "display": "flex",
+            "flex": null,
+            "flex_flow": "column",
+            "grid_area": null,
+            "grid_auto_columns": null,
+            "grid_auto_flow": null,
+            "grid_auto_rows": null,
+            "grid_column": null,
+            "grid_gap": null,
+            "grid_row": null,
+            "grid_template_areas": null,
+            "grid_template_columns": null,
+            "grid_template_rows": null,
+            "height": null,
+            "justify_content": null,
+            "justify_items": null,
+            "left": null,
+            "margin": null,
+            "max_height": null,
+            "max_width": null,
+            "min_height": null,
+            "min_width": null,
+            "object_fit": null,
+            "object_position": null,
+            "order": null,
+            "overflow": null,
+            "overflow_x": null,
+            "overflow_y": null,
+            "padding": null,
+            "right": null,
+            "top": null,
+            "visibility": null,
+            "width": "50%"
+          }
+        },
+        "c72dd3d6a4c246cfa6590c314783c8f0": {
+          "model_module": "@jupyter-widgets/base",
+          "model_name": "LayoutModel",
+          "model_module_version": "1.2.0",
+          "state": {
+            "_model_module": "@jupyter-widgets/base",
+            "_model_module_version": "1.2.0",
+            "_model_name": "LayoutModel",
+            "_view_count": null,
+            "_view_module": "@jupyter-widgets/base",
+            "_view_module_version": "1.2.0",
+            "_view_name": "LayoutView",
+            "align_content": null,
+            "align_items": null,
+            "align_self": null,
+            "border": null,
+            "bottom": null,
+            "display": null,
+            "flex": null,
+            "flex_flow": null,
+            "grid_area": null,
+            "grid_auto_columns": null,
+            "grid_auto_flow": null,
+            "grid_auto_rows": null,
+            "grid_column": null,
+            "grid_gap": null,
+            "grid_row": null,
+            "grid_template_areas": null,
+            "grid_template_columns": null,
+            "grid_template_rows": null,
+            "height": null,
+            "justify_content": null,
+            "justify_items": null,
+            "left": null,
+            "margin": null,
+            "max_height": null,
+            "max_width": null,
+            "min_height": null,
+            "min_width": null,
+            "object_fit": null,
+            "object_position": null,
+            "order": null,
+            "overflow": null,
+            "overflow_x": null,
+            "overflow_y": null,
+            "padding": null,
+            "right": null,
+            "top": null,
+            "visibility": null,
+            "width": null
+          }
+        },
+        "c0e471e664dd41eab98efe08301ef5e1": {
+          "model_module": "@jupyter-widgets/controls",
+          "model_name": "DescriptionStyleModel",
+          "model_module_version": "1.5.0",
+          "state": {
+            "_model_module": "@jupyter-widgets/controls",
+            "_model_module_version": "1.5.0",
+            "_model_name": "DescriptionStyleModel",
+            "_view_count": null,
+            "_view_module": "@jupyter-widgets/base",
+            "_view_module_version": "1.2.0",
+            "_view_name": "StyleView",
+            "description_width": ""
+          }
+        },
+        "868f63ea9455442d837dc2c422918800": {
+          "model_module": "@jupyter-widgets/base",
+          "model_name": "LayoutModel",
+          "model_module_version": "1.2.0",
+          "state": {
+            "_model_module": "@jupyter-widgets/base",
+            "_model_module_version": "1.2.0",
+            "_model_name": "LayoutModel",
+            "_view_count": null,
+            "_view_module": "@jupyter-widgets/base",
+            "_view_module_version": "1.2.0",
+            "_view_name": "LayoutView",
+            "align_content": null,
+            "align_items": null,
+            "align_self": null,
+            "border": null,
+            "bottom": null,
+            "display": null,
+            "flex": null,
+            "flex_flow": null,
+            "grid_area": null,
+            "grid_auto_columns": null,
+            "grid_auto_flow": null,
+            "grid_auto_rows": null,
+            "grid_column": null,
+            "grid_gap": null,
+            "grid_row": null,
+            "grid_template_areas": null,
+            "grid_template_columns": null,
+            "grid_template_rows": null,
+            "height": null,
+            "justify_content": null,
+            "justify_items": null,
+            "left": null,
+            "margin": null,
+            "max_height": null,
+            "max_width": null,
+            "min_height": null,
+            "min_width": null,
+            "object_fit": null,
+            "object_position": null,
+            "order": null,
+            "overflow": null,
+            "overflow_x": null,
+            "overflow_y": null,
+            "padding": null,
+            "right": null,
+            "top": null,
+            "visibility": null,
+            "width": null
+          }
+        },
+        "5b7b4707b1bf4159a10bf7e289bde435": {
+          "model_module": "@jupyter-widgets/controls",
+          "model_name": "DescriptionStyleModel",
+          "model_module_version": "1.5.0",
+          "state": {
+            "_model_module": "@jupyter-widgets/controls",
+            "_model_module_version": "1.5.0",
+            "_model_name": "DescriptionStyleModel",
+            "_view_count": null,
+            "_view_module": "@jupyter-widgets/base",
+            "_view_module_version": "1.2.0",
+            "_view_name": "StyleView",
+            "description_width": ""
+          }
+        },
+        "889d0d1ed24e4de2b89896511d008e60": {
+          "model_module": "@jupyter-widgets/base",
+          "model_name": "LayoutModel",
+          "model_module_version": "1.2.0",
+          "state": {
+            "_model_module": "@jupyter-widgets/base",
+            "_model_module_version": "1.2.0",
+            "_model_name": "LayoutModel",
+            "_view_count": null,
+            "_view_module": "@jupyter-widgets/base",
+            "_view_module_version": "1.2.0",
+            "_view_name": "LayoutView",
+            "align_content": null,
+            "align_items": null,
+            "align_self": null,
+            "border": null,
+            "bottom": null,
+            "display": null,
+            "flex": null,
+            "flex_flow": null,
+            "grid_area": null,
+            "grid_auto_columns": null,
+            "grid_auto_flow": null,
+            "grid_auto_rows": null,
+            "grid_column": null,
+            "grid_gap": null,
+            "grid_row": null,
+            "grid_template_areas": null,
+            "grid_template_columns": null,
+            "grid_template_rows": null,
+            "height": null,
+            "justify_content": null,
+            "justify_items": null,
+            "left": null,
+            "margin": null,
+            "max_height": null,
+            "max_width": null,
+            "min_height": null,
+            "min_width": null,
+            "object_fit": null,
+            "object_position": null,
+            "order": null,
+            "overflow": null,
+            "overflow_x": null,
+            "overflow_y": null,
+            "padding": null,
+            "right": null,
+            "top": null,
+            "visibility": null,
+            "width": null
+          }
+        },
+        "68fc757825dd44a48ab2383db20958db": {
+          "model_module": "@jupyter-widgets/controls",
+          "model_name": "DescriptionStyleModel",
+          "model_module_version": "1.5.0",
+          "state": {
+            "_model_module": "@jupyter-widgets/controls",
+            "_model_module_version": "1.5.0",
+            "_model_name": "DescriptionStyleModel",
+            "_view_count": null,
+            "_view_module": "@jupyter-widgets/base",
+            "_view_module_version": "1.2.0",
+            "_view_name": "StyleView",
+            "description_width": ""
+          }
+        },
+        "cb76f933e6e640d9a688f7838e5fb0b3": {
+          "model_module": "@jupyter-widgets/base",
+          "model_name": "LayoutModel",
+          "model_module_version": "1.2.0",
+          "state": {
+            "_model_module": "@jupyter-widgets/base",
+            "_model_module_version": "1.2.0",
+            "_model_name": "LayoutModel",
+            "_view_count": null,
+            "_view_module": "@jupyter-widgets/base",
+            "_view_module_version": "1.2.0",
+            "_view_name": "LayoutView",
+            "align_content": null,
+            "align_items": null,
+            "align_self": null,
+            "border": null,
+            "bottom": null,
+            "display": null,
+            "flex": null,
+            "flex_flow": null,
+            "grid_area": null,
+            "grid_auto_columns": null,
+            "grid_auto_flow": null,
+            "grid_auto_rows": null,
+            "grid_column": null,
+            "grid_gap": null,
+            "grid_row": null,
+            "grid_template_areas": null,
+            "grid_template_columns": null,
+            "grid_template_rows": null,
+            "height": null,
+            "justify_content": null,
+            "justify_items": null,
+            "left": null,
+            "margin": null,
+            "max_height": null,
+            "max_width": null,
+            "min_height": null,
+            "min_width": null,
+            "object_fit": null,
+            "object_position": null,
+            "order": null,
+            "overflow": null,
+            "overflow_x": null,
+            "overflow_y": null,
+            "padding": null,
+            "right": null,
+            "top": null,
+            "visibility": null,
+            "width": null
+          }
+        },
+        "8704264bff4d46c9813ac9acf92da962": {
+          "model_module": "@jupyter-widgets/controls",
+          "model_name": "ButtonStyleModel",
+          "model_module_version": "1.5.0",
+          "state": {
+            "_model_module": "@jupyter-widgets/controls",
+            "_model_module_version": "1.5.0",
+            "_model_name": "ButtonStyleModel",
+            "_view_count": null,
+            "_view_module": "@jupyter-widgets/base",
+            "_view_module_version": "1.2.0",
+            "_view_name": "StyleView",
+            "button_color": null,
+            "font_weight": ""
+          }
+        },
+        "9b5d87960dde401baeaf8b6144fb8bad": {
+          "model_module": "@jupyter-widgets/base",
+          "model_name": "LayoutModel",
+          "model_module_version": "1.2.0",
+          "state": {
+            "_model_module": "@jupyter-widgets/base",
+            "_model_module_version": "1.2.0",
+            "_model_name": "LayoutModel",
+            "_view_count": null,
+            "_view_module": "@jupyter-widgets/base",
+            "_view_module_version": "1.2.0",
+            "_view_name": "LayoutView",
+            "align_content": null,
+            "align_items": null,
+            "align_self": null,
+            "border": null,
+            "bottom": null,
+            "display": null,
+            "flex": null,
+            "flex_flow": null,
+            "grid_area": null,
+            "grid_auto_columns": null,
+            "grid_auto_flow": null,
+            "grid_auto_rows": null,
+            "grid_column": null,
+            "grid_gap": null,
+            "grid_row": null,
+            "grid_template_areas": null,
+            "grid_template_columns": null,
+            "grid_template_rows": null,
+            "height": null,
+            "justify_content": null,
+            "justify_items": null,
+            "left": null,
+            "margin": null,
+            "max_height": null,
+            "max_width": null,
+            "min_height": null,
+            "min_width": null,
+            "object_fit": null,
+            "object_position": null,
+            "order": null,
+            "overflow": null,
+            "overflow_x": null,
+            "overflow_y": null,
+            "padding": null,
+            "right": null,
+            "top": null,
+            "visibility": null,
+            "width": null
+          }
+        },
+        "76e06881e5e94197a24944e07fdf3189": {
+          "model_module": "@jupyter-widgets/controls",
+          "model_name": "DescriptionStyleModel",
+          "model_module_version": "1.5.0",
+          "state": {
+            "_model_module": "@jupyter-widgets/controls",
+            "_model_module_version": "1.5.0",
+            "_model_name": "DescriptionStyleModel",
+            "_view_count": null,
+            "_view_module": "@jupyter-widgets/base",
+            "_view_module_version": "1.2.0",
+            "_view_name": "StyleView",
+            "description_width": ""
+          }
+        },
+        "f40dd696acc64c6284c6f8f485f3ce9d": {
+          "model_module": "@jupyter-widgets/controls",
+          "model_name": "LabelModel",
+          "model_module_version": "1.5.0",
+          "state": {
+            "_dom_classes": [],
+            "_model_module": "@jupyter-widgets/controls",
+            "_model_module_version": "1.5.0",
+            "_model_name": "LabelModel",
+            "_view_count": null,
+            "_view_module": "@jupyter-widgets/controls",
+            "_view_module_version": "1.5.0",
+            "_view_name": "LabelView",
+            "description": "",
+            "description_tooltip": null,
+            "layout": "IPY_MODEL_4488de26dce74cbbb39d99ae09bd21fa",
+            "placeholder": "",
+            "style": "IPY_MODEL_ded62e6c032745ec88ca0ab694b0d397",
+            "value": "Connecting..."
+          }
+        },
+        "4488de26dce74cbbb39d99ae09bd21fa": {
+          "model_module": "@jupyter-widgets/base",
+          "model_name": "LayoutModel",
+          "model_module_version": "1.2.0",
+          "state": {
+            "_model_module": "@jupyter-widgets/base",
+            "_model_module_version": "1.2.0",
+            "_model_name": "LayoutModel",
+            "_view_count": null,
+            "_view_module": "@jupyter-widgets/base",
+            "_view_module_version": "1.2.0",
+            "_view_name": "LayoutView",
+            "align_content": null,
+            "align_items": null,
+            "align_self": null,
+            "border": null,
+            "bottom": null,
+            "display": null,
+            "flex": null,
+            "flex_flow": null,
+            "grid_area": null,
+            "grid_auto_columns": null,
+            "grid_auto_flow": null,
+            "grid_auto_rows": null,
+            "grid_column": null,
+            "grid_gap": null,
+            "grid_row": null,
+            "grid_template_areas": null,
+            "grid_template_columns": null,
+            "grid_template_rows": null,
+            "height": null,
+            "justify_content": null,
+            "justify_items": null,
+            "left": null,
+            "margin": null,
+            "max_height": null,
+            "max_width": null,
+            "min_height": null,
+            "min_width": null,
+            "object_fit": null,
+            "object_position": null,
+            "order": null,
+            "overflow": null,
+            "overflow_x": null,
+            "overflow_y": null,
+            "padding": null,
+            "right": null,
+            "top": null,
+            "visibility": null,
+            "width": null
+          }
+        },
+        "ded62e6c032745ec88ca0ab694b0d397": {
+          "model_module": "@jupyter-widgets/controls",
+          "model_name": "DescriptionStyleModel",
+          "model_module_version": "1.5.0",
+          "state": {
+            "_model_module": "@jupyter-widgets/controls",
+            "_model_module_version": "1.5.0",
+            "_model_name": "DescriptionStyleModel",
+            "_view_count": null,
+            "_view_module": "@jupyter-widgets/base",
+            "_view_module_version": "1.2.0",
+            "_view_name": "StyleView",
+            "description_width": ""
+          }
+        },
+        "be523e956910487ca263d943a7a58395": {
+          "model_module": "@jupyter-widgets/controls",
+          "model_name": "HBoxModel",
+          "model_module_version": "1.5.0",
+          "state": {
+            "_dom_classes": [],
+            "_model_module": "@jupyter-widgets/controls",
+            "_model_module_version": "1.5.0",
+            "_model_name": "HBoxModel",
+            "_view_count": null,
+            "_view_module": "@jupyter-widgets/controls",
+            "_view_module_version": "1.5.0",
+            "_view_name": "HBoxView",
+            "box_style": "",
+            "children": [
+              "IPY_MODEL_01dc23faab3d42cda41fdfdd2a7dfed5",
+              "IPY_MODEL_777d7addfb144fd8896b77a1e0d54f25",
+              "IPY_MODEL_c518268069244b21810e84380502c190"
+            ],
+            "layout": "IPY_MODEL_fee72c1c455549b59092028b855a082a"
+          }
+        },
+        "01dc23faab3d42cda41fdfdd2a7dfed5": {
+          "model_module": "@jupyter-widgets/controls",
+          "model_name": "HTMLModel",
+          "model_module_version": "1.5.0",
+          "state": {
+            "_dom_classes": [],
+            "_model_module": "@jupyter-widgets/controls",
+            "_model_module_version": "1.5.0",
+            "_model_name": "HTMLModel",
+            "_view_count": null,
+            "_view_module": "@jupyter-widgets/controls",
+            "_view_module_version": "1.5.0",
+            "_view_name": "HTMLView",
+            "description": "",
+            "description_tooltip": null,
+            "layout": "IPY_MODEL_ed0fa93199b94fb486c125d4f322d59f",
+            "placeholder": "",
+            "style": "IPY_MODEL_66f82e7ef3694c699e3d4a2bd826392b",
+            "value": "Loading checkpoint shards: 100%"
+          }
+        },
+        "777d7addfb144fd8896b77a1e0d54f25": {
+          "model_module": "@jupyter-widgets/controls",
+          "model_name": "FloatProgressModel",
+          "model_module_version": "1.5.0",
+          "state": {
+            "_dom_classes": [],
+            "_model_module": "@jupyter-widgets/controls",
+            "_model_module_version": "1.5.0",
+            "_model_name": "FloatProgressModel",
+            "_view_count": null,
+            "_view_module": "@jupyter-widgets/controls",
+            "_view_module_version": "1.5.0",
+            "_view_name": "ProgressView",
+            "bar_style": "success",
+            "description": "",
+            "description_tooltip": null,
+            "layout": "IPY_MODEL_2bfd51e3ae954008ae83704c24dbd6cb",
+            "max": 4,
+            "min": 0,
+            "orientation": "horizontal",
+            "style": "IPY_MODEL_f8b84d8c06384680973ef6fe787b5a5d",
+            "value": 4
+          }
+        },
+        "c518268069244b21810e84380502c190": {
+          "model_module": "@jupyter-widgets/controls",
+          "model_name": "HTMLModel",
+          "model_module_version": "1.5.0",
+          "state": {
+            "_dom_classes": [],
+            "_model_module": "@jupyter-widgets/controls",
+            "_model_module_version": "1.5.0",
+            "_model_name": "HTMLModel",
+            "_view_count": null,
+            "_view_module": "@jupyter-widgets/controls",
+            "_view_module_version": "1.5.0",
+            "_view_name": "HTMLView",
+            "description": "",
+            "description_tooltip": null,
+            "layout": "IPY_MODEL_770341dc116148a8b7571cce3a2f2baf",
+            "placeholder": "",
+            "style": "IPY_MODEL_29416122cc0b4a5592668ddced7686ba",
+            "value": " 4/4 [00:00&lt;00:00,  5.03it/s]"
+          }
+        },
+        "fee72c1c455549b59092028b855a082a": {
+          "model_module": "@jupyter-widgets/base",
+          "model_name": "LayoutModel",
+          "model_module_version": "1.2.0",
+          "state": {
+            "_model_module": "@jupyter-widgets/base",
+            "_model_module_version": "1.2.0",
+            "_model_name": "LayoutModel",
+            "_view_count": null,
+            "_view_module": "@jupyter-widgets/base",
+            "_view_module_version": "1.2.0",
+            "_view_name": "LayoutView",
+            "align_content": null,
+            "align_items": null,
+            "align_self": null,
+            "border": null,
+            "bottom": null,
+            "display": null,
+            "flex": null,
+            "flex_flow": null,
+            "grid_area": null,
+            "grid_auto_columns": null,
+            "grid_auto_flow": null,
+            "grid_auto_rows": null,
+            "grid_column": null,
+            "grid_gap": null,
+            "grid_row": null,
+            "grid_template_areas": null,
+            "grid_template_columns": null,
+            "grid_template_rows": null,
+            "height": null,
+            "justify_content": null,
+            "justify_items": null,
+            "left": null,
+            "margin": null,
+            "max_height": null,
+            "max_width": null,
+            "min_height": null,
+            "min_width": null,
+            "object_fit": null,
+            "object_position": null,
+            "order": null,
+            "overflow": null,
+            "overflow_x": null,
+            "overflow_y": null,
+            "padding": null,
+            "right": null,
+            "top": null,
+            "visibility": null,
+            "width": null
+          }
+        },
+        "ed0fa93199b94fb486c125d4f322d59f": {
+          "model_module": "@jupyter-widgets/base",
+          "model_name": "LayoutModel",
+          "model_module_version": "1.2.0",
+          "state": {
+            "_model_module": "@jupyter-widgets/base",
+            "_model_module_version": "1.2.0",
+            "_model_name": "LayoutModel",
+            "_view_count": null,
+            "_view_module": "@jupyter-widgets/base",
+            "_view_module_version": "1.2.0",
+            "_view_name": "LayoutView",
+            "align_content": null,
+            "align_items": null,
+            "align_self": null,
+            "border": null,
+            "bottom": null,
+            "display": null,
+            "flex": null,
+            "flex_flow": null,
+            "grid_area": null,
+            "grid_auto_columns": null,
+            "grid_auto_flow": null,
+            "grid_auto_rows": null,
+            "grid_column": null,
+            "grid_gap": null,
+            "grid_row": null,
+            "grid_template_areas": null,
+            "grid_template_columns": null,
+            "grid_template_rows": null,
+            "height": null,
+            "justify_content": null,
+            "justify_items": null,
+            "left": null,
+            "margin": null,
+            "max_height": null,
+            "max_width": null,
+            "min_height": null,
+            "min_width": null,
+            "object_fit": null,
+            "object_position": null,
+            "order": null,
+            "overflow": null,
+            "overflow_x": null,
+            "overflow_y": null,
+            "padding": null,
+            "right": null,
+            "top": null,
+            "visibility": null,
+            "width": null
+          }
+        },
+        "66f82e7ef3694c699e3d4a2bd826392b": {
+          "model_module": "@jupyter-widgets/controls",
+          "model_name": "DescriptionStyleModel",
+          "model_module_version": "1.5.0",
+          "state": {
+            "_model_module": "@jupyter-widgets/controls",
+            "_model_module_version": "1.5.0",
+            "_model_name": "DescriptionStyleModel",
+            "_view_count": null,
+            "_view_module": "@jupyter-widgets/base",
+            "_view_module_version": "1.2.0",
+            "_view_name": "StyleView",
+            "description_width": ""
+          }
+        },
+        "2bfd51e3ae954008ae83704c24dbd6cb": {
+          "model_module": "@jupyter-widgets/base",
+          "model_name": "LayoutModel",
+          "model_module_version": "1.2.0",
+          "state": {
+            "_model_module": "@jupyter-widgets/base",
+            "_model_module_version": "1.2.0",
+            "_model_name": "LayoutModel",
+            "_view_count": null,
+            "_view_module": "@jupyter-widgets/base",
+            "_view_module_version": "1.2.0",
+            "_view_name": "LayoutView",
+            "align_content": null,
+            "align_items": null,
+            "align_self": null,
+            "border": null,
+            "bottom": null,
+            "display": null,
+            "flex": null,
+            "flex_flow": null,
+            "grid_area": null,
+            "grid_auto_columns": null,
+            "grid_auto_flow": null,
+            "grid_auto_rows": null,
+            "grid_column": null,
+            "grid_gap": null,
+            "grid_row": null,
+            "grid_template_areas": null,
+            "grid_template_columns": null,
+            "grid_template_rows": null,
+            "height": null,
+            "justify_content": null,
+            "justify_items": null,
+            "left": null,
+            "margin": null,
+            "max_height": null,
+            "max_width": null,
+            "min_height": null,
+            "min_width": null,
+            "object_fit": null,
+            "object_position": null,
+            "order": null,
+            "overflow": null,
+            "overflow_x": null,
+            "overflow_y": null,
+            "padding": null,
+            "right": null,
+            "top": null,
+            "visibility": null,
+            "width": null
+          }
+        },
+        "f8b84d8c06384680973ef6fe787b5a5d": {
+          "model_module": "@jupyter-widgets/controls",
+          "model_name": "ProgressStyleModel",
+          "model_module_version": "1.5.0",
+          "state": {
+            "_model_module": "@jupyter-widgets/controls",
+            "_model_module_version": "1.5.0",
+            "_model_name": "ProgressStyleModel",
+            "_view_count": null,
+            "_view_module": "@jupyter-widgets/base",
+            "_view_module_version": "1.2.0",
+            "_view_name": "StyleView",
+            "bar_color": null,
+            "description_width": ""
+          }
+        },
+        "770341dc116148a8b7571cce3a2f2baf": {
+          "model_module": "@jupyter-widgets/base",
+          "model_name": "LayoutModel",
+          "model_module_version": "1.2.0",
+          "state": {
+            "_model_module": "@jupyter-widgets/base",
+            "_model_module_version": "1.2.0",
+            "_model_name": "LayoutModel",
+            "_view_count": null,
+            "_view_module": "@jupyter-widgets/base",
+            "_view_module_version": "1.2.0",
+            "_view_name": "LayoutView",
+            "align_content": null,
+            "align_items": null,
+            "align_self": null,
+            "border": null,
+            "bottom": null,
+            "display": null,
+            "flex": null,
+            "flex_flow": null,
+            "grid_area": null,
+            "grid_auto_columns": null,
+            "grid_auto_flow": null,
+            "grid_auto_rows": null,
+            "grid_column": null,
+            "grid_gap": null,
+            "grid_row": null,
+            "grid_template_areas": null,
+            "grid_template_columns": null,
+            "grid_template_rows": null,
+            "height": null,
+            "justify_content": null,
+            "justify_items": null,
+            "left": null,
+            "margin": null,
+            "max_height": null,
+            "max_width": null,
+            "min_height": null,
+            "min_width": null,
+            "object_fit": null,
+            "object_position": null,
+            "order": null,
+            "overflow": null,
+            "overflow_x": null,
+            "overflow_y": null,
+            "padding": null,
+            "right": null,
+            "top": null,
+            "visibility": null,
+            "width": null
+          }
+        },
+        "29416122cc0b4a5592668ddced7686ba": {
+          "model_module": "@jupyter-widgets/controls",
+          "model_name": "DescriptionStyleModel",
+          "model_module_version": "1.5.0",
+          "state": {
+            "_model_module": "@jupyter-widgets/controls",
+            "_model_module_version": "1.5.0",
+            "_model_name": "DescriptionStyleModel",
+            "_view_count": null,
+            "_view_module": "@jupyter-widgets/base",
+            "_view_module_version": "1.2.0",
+            "_view_name": "StyleView",
+            "description_width": ""
+          }
+        }
+      }
+    }
+  },
+  "cells": [
+    {
+      "cell_type": "markdown",
+      "metadata": {
+        "id": "view-in-github",
+        "colab_type": "text"
+      },
+      "source": [
+        "<a href=\"https://colab.research.google.com/github/merveenoyan/smol-vision/blob/main/Gemma_3n_Video_Vibe_Tests.ipynb\" target=\"_parent\"><img src=\"https://colab.research.google.com/assets/colab-badge.svg\" alt=\"Open In Colab\"/></a>"
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "source": [
+        "## Gemma 3n Video with Audio Inference"
+      ],
+      "metadata": {
+        "id": "onFz3_7AqnaB"
+      }
+    },
+    {
+      "cell_type": "markdown",
+      "source": [
+        "In this notebook we'll infer Gemma-3n videos with audios inside."
+      ],
+      "metadata": {
+        "id": "KKUnhy4JqqAg"
+      }
+    },
+    {
+      "cell_type": "code",
+      "source": [
+        "!pip install -U -q transformers timm datasets"
+      ],
+      "metadata": {
+        "id": "Vf-VvnrNjuxF"
+      },
+      "execution_count": null,
+      "outputs": []
+    },
+    {
+      "cell_type": "markdown",
+      "source": [
+        "We will load three examples from FineVideo dataset and Gemma-3n model so make sure you have access to both and provide access token."
+      ],
+      "metadata": {
+        "id": "gcJbxIPLqvjH"
+      }
+    },
+    {
+      "cell_type": "code",
+      "source": [
+        "from huggingface_hub import login\n",
+        "login()"
+      ],
+      "metadata": {
+        "id": "bROdG2-Jj9lT",
+        "colab": {
+          "base_uri": "https://localhost:8080/",
+          "height": 17,
+          "referenced_widgets": [
+            "542490f74e974451bc44009a6fa174bd",
+            "409f985be1134b468b81136fbdb54408",
+            "57cb1e931c614980a4147cb125524d7d",
+            "87dc7aaf52e349a7bb43bb1b8bc137ee",
+            "983ed4cb4eea42daa9ae8c0417021a21",
+            "40c381fd7bb04b43a879044a4e988cc6",
+            "8d0e5abdd7c549f1a66ee198c9fa1430",
+            "c72dd3d6a4c246cfa6590c314783c8f0",
+            "c0e471e664dd41eab98efe08301ef5e1",
+            "868f63ea9455442d837dc2c422918800",
+            "5b7b4707b1bf4159a10bf7e289bde435",
+            "889d0d1ed24e4de2b89896511d008e60",
+            "68fc757825dd44a48ab2383db20958db",
+            "cb76f933e6e640d9a688f7838e5fb0b3",
+            "8704264bff4d46c9813ac9acf92da962",
+            "9b5d87960dde401baeaf8b6144fb8bad",
+            "76e06881e5e94197a24944e07fdf3189",
+            "f40dd696acc64c6284c6f8f485f3ce9d",
+            "4488de26dce74cbbb39d99ae09bd21fa",
+            "ded62e6c032745ec88ca0ab694b0d397"
+          ]
+        },
+        "outputId": "1978e9bd-3b52-40b8-e643-418f9872476d"
+      },
+      "execution_count": null,
+      "outputs": [
+        {
+          "output_type": "display_data",
+          "data": {
+            "text/plain": [
+              "VBox(children=(HTML(value='<center> <img\\nsrc=https://huggingface.co/front/assets/huggingface_logo-noborder.sv…"
+            ],
+            "application/vnd.jupyter.widget-view+json": {
+              "version_major": 2,
+              "version_minor": 0,
+              "model_id": "542490f74e974451bc44009a6fa174bd"
+            }
+          },
+          "metadata": {}
+        }
+      ]
+    },
+    {
+      "cell_type": "code",
+      "execution_count": null,
+      "metadata": {
+        "id": "TMiKyRtAjjAc",
+        "colab": {
+          "base_uri": "https://localhost:8080/",
+          "height": 173,
+          "referenced_widgets": [
+            "be523e956910487ca263d943a7a58395",
+            "01dc23faab3d42cda41fdfdd2a7dfed5",
+            "777d7addfb144fd8896b77a1e0d54f25",
+            "c518268069244b21810e84380502c190",
+            "fee72c1c455549b59092028b855a082a",
+            "ed0fa93199b94fb486c125d4f322d59f",
+            "66f82e7ef3694c699e3d4a2bd826392b",
+            "2bfd51e3ae954008ae83704c24dbd6cb",
+            "f8b84d8c06384680973ef6fe787b5a5d",
+            "770341dc116148a8b7571cce3a2f2baf",
+            "29416122cc0b4a5592668ddced7686ba"
+          ]
+        },
+        "outputId": "7351e21a-3c82-4d0c-c827-24b66812f181"
+      },
+      "outputs": [
+        {
+          "output_type": "stream",
+          "name": "stderr",
+          "text": [
+            "/usr/local/lib/python3.11/dist-packages/huggingface_hub/utils/_auth.py:94: UserWarning: \n",
+            "The secret `HF_TOKEN` does not exist in your Colab secrets.\n",
+            "To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.\n",
+            "You will be able to reuse this secret in all of your notebooks.\n",
+            "Please note that authentication is recommended but still optional to access public models or datasets.\n",
+            "  warnings.warn(\n"
+          ]
+        },
+        {
+          "output_type": "display_data",
+          "data": {
+            "text/plain": [
+              "Loading checkpoint shards:   0%|          | 0/4 [00:00<?, ?it/s]"
+            ],
+            "application/vnd.jupyter.widget-view+json": {
+              "version_major": 2,
+              "version_minor": 0,
+              "model_id": "be523e956910487ca263d943a7a58395"
+            }
+          },
+          "metadata": {}
+        }
+      ],
+      "source": [
+        "from transformers import AutoProcessor, Gemma3nForConditionalGeneration\n",
+        "import torch\n",
+        "model = Gemma3nForConditionalGeneration.from_pretrained(\n",
+        "    \"google/gemma-3n-E4B-it\", torch_dtype=torch.bfloat16,\n",
+        ").to(\"cuda\")\n",
+        "processor = AutoProcessor.from_pretrained(\n",
+        "    \"google/gemma-3n-E4B-it\",\n",
+        ")\n",
+        "processor.tokenizer.padding_side = \"right\""
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "source": [
+        "Download video for inference."
+      ],
+      "metadata": {
+        "id": "mQzrURJlNRwW"
+      }
+    },
+    {
+      "cell_type": "code",
+      "source": [
+        "!wget https://huggingface.co/datasets/merve/vlm_test_images/resolve/main/IMG_8137.mp4"
+      ],
+      "metadata": {
+        "colab": {
+          "base_uri": "https://localhost:8080/"
+        },
+        "id": "PAQ1S2uDMIzj",
+        "outputId": "c584ee8c-b960-4f82-f2c6-be194709256f"
+      },
+      "execution_count": null,
+      "outputs": [
+        {
+          "output_type": "stream",
+          "name": "stdout",
+          "text": [
+            "--2025-07-01 13:39:22--  https://huggingface.co/datasets/merve/vlm_test_images/resolve/main/IMG_8137.mp4\n",
+            "Resolving huggingface.co (huggingface.co)... 18.172.134.4, 18.172.134.24, 18.172.134.124, ...\n",
+            "Connecting to huggingface.co (huggingface.co)|18.172.134.4|:443... connected.\n",
+            "HTTP request sent, awaiting response... 302 Found\n",
+            "Location: https://cdn-lfs-us-1.hf.co/repos/7b/14/7b14679bb56cefbf7829be71f3f444110ccc308f431bd8596f534e743367ea5c/6331cbb913feb48349e3b7015a7969e04ce3cd594b1bda7278e4e33fe4a3f5f3?response-content-disposition=inline%3B+filename*%3DUTF-8%27%27IMG_8137.mp4%3B+filename%3D%22IMG_8137.mp4%22%3B&response-content-type=video%2Fmp4&Expires=1751380762&Policy=eyJTdGF0ZW1lbnQiOlt7IkNvbmRpdGlvbiI6eyJEYXRlTGVzc1RoYW4iOnsiQVdTOkVwb2NoVGltZSI6MTc1MTM4MDc2Mn19LCJSZXNvdXJjZSI6Imh0dHBzOi8vY2RuLWxmcy11cy0xLmhmLmNvL3JlcG9zLzdiLzE0LzdiMTQ2NzliYjU2Y2VmYmY3ODI5YmU3MWYzZjQ0NDExMGNjYzMwOGY0MzFiZDg1OTZmNTM0ZTc0MzM2N2VhNWMvNjMzMWNiYjkxM2ZlYjQ4MzQ5ZTNiNzAxNWE3OTY5ZTA0Y2UzY2Q1OTRiMWJkYTcyNzhlNGUzM2ZlNGEzZjVmMz9yZXNwb25zZS1jb250ZW50LWRpc3Bvc2l0aW9uPSomcmVzcG9uc2UtY29udGVudC10eXBlPSoifV19&Signature=MsPaMyO17sK%7Eo3U41ncCYEHd2vpjR6Jvv2IiqrhIy45kp-2WPdIGaYg5F7g9ENDJfFqmYavs6VH26AdLbX3HLPBUoR%7EAV8Iew8V1lFK1SpMkyCkh0SMtYNHqSw27jJ1ZSIhMKnHA7hRGi5b8LAhBiGzmlikz4a%7EtZAjjQZ18ZyN8GxCvTironzCp3uKUExWpRQF%7EwEwqurBb%7EKs-uJ6KDLvshYInzF%7Eo1LEoRNlXdxmDk8Q5Q7ZnBFM5m%7EPvBt-OQ4WWDPQZ86qblHwtoAgf483cdviYLPd8PjGzarQxgrjxbqELMvXM-nvUdXcOuAwhbBzpzSwBGQManPZxOFKTFw__&Key-Pair-Id=K24J24Z295AEI9 [following]\n",
+            "--2025-07-01 13:39:22--  https://cdn-lfs-us-1.hf.co/repos/7b/14/7b14679bb56cefbf7829be71f3f444110ccc308f431bd8596f534e743367ea5c/6331cbb913feb48349e3b7015a7969e04ce3cd594b1bda7278e4e33fe4a3f5f3?response-content-disposition=inline%3B+filename*%3DUTF-8%27%27IMG_8137.mp4%3B+filename%3D%22IMG_8137.mp4%22%3B&response-content-type=video%2Fmp4&Expires=1751380762&Policy=eyJTdGF0ZW1lbnQiOlt7IkNvbmRpdGlvbiI6eyJEYXRlTGVzc1RoYW4iOnsiQVdTOkVwb2NoVGltZSI6MTc1MTM4MDc2Mn19LCJSZXNvdXJjZSI6Imh0dHBzOi8vY2RuLWxmcy11cy0xLmhmLmNvL3JlcG9zLzdiLzE0LzdiMTQ2NzliYjU2Y2VmYmY3ODI5YmU3MWYzZjQ0NDExMGNjYzMwOGY0MzFiZDg1OTZmNTM0ZTc0MzM2N2VhNWMvNjMzMWNiYjkxM2ZlYjQ4MzQ5ZTNiNzAxNWE3OTY5ZTA0Y2UzY2Q1OTRiMWJkYTcyNzhlNGUzM2ZlNGEzZjVmMz9yZXNwb25zZS1jb250ZW50LWRpc3Bvc2l0aW9uPSomcmVzcG9uc2UtY29udGVudC10eXBlPSoifV19&Signature=MsPaMyO17sK%7Eo3U41ncCYEHd2vpjR6Jvv2IiqrhIy45kp-2WPdIGaYg5F7g9ENDJfFqmYavs6VH26AdLbX3HLPBUoR%7EAV8Iew8V1lFK1SpMkyCkh0SMtYNHqSw27jJ1ZSIhMKnHA7hRGi5b8LAhBiGzmlikz4a%7EtZAjjQZ18ZyN8GxCvTironzCp3uKUExWpRQF%7EwEwqurBb%7EKs-uJ6KDLvshYInzF%7Eo1LEoRNlXdxmDk8Q5Q7ZnBFM5m%7EPvBt-OQ4WWDPQZ86qblHwtoAgf483cdviYLPd8PjGzarQxgrjxbqELMvXM-nvUdXcOuAwhbBzpzSwBGQManPZxOFKTFw__&Key-Pair-Id=K24J24Z295AEI9\n",
+            "Resolving cdn-lfs-us-1.hf.co (cdn-lfs-us-1.hf.co)... 3.167.138.114, 3.167.138.90, 3.167.138.39, ...\n",
+            "Connecting to cdn-lfs-us-1.hf.co (cdn-lfs-us-1.hf.co)|3.167.138.114|:443... connected.\n",
+            "HTTP request sent, awaiting response... 200 OK\n",
+            "Length: 5340706 (5.1M) [video/mp4]\n",
+            "Saving to: ‘IMG_8137.mp4’\n",
+            "\n",
+            "IMG_8137.mp4        100%[===================>]   5.09M  27.1MB/s    in 0.2s    \n",
+            "\n",
+            "2025-07-01 13:39:22 (27.1 MB/s) - ‘IMG_8137.mp4’ saved [5340706/5340706]\n",
+            "\n"
+          ]
+        }
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "source": [
+        "Strip audios from video."
+      ],
+      "metadata": {
+        "id": "KXlBj7dVtUFZ"
+      }
+    },
+    {
+      "cell_type": "code",
+      "source": [
+        "import os\n",
+        "import subprocess\n",
+        "filename = \"IMG_8137.mp4\"\n",
+        "audio_path = os.path.join(\"audios\", f\"audio.wav\")\n",
+        "\n",
+        "subprocess.run([\n",
+        "    \"ffmpeg\", \"-i\", filename,\n",
+        "    \"-q:a\", \"0\", \"-map\", \"a\",\n",
+        "    audio_path,\n",
+        "    \"-y\"\n",
+        "], stdout=subprocess.DEVNULL, stderr=subprocess.DEVNULL)"
+      ],
+      "metadata": {
+        "colab": {
+          "base_uri": "https://localhost:8080/"
+        },
+        "id": "FQhKimtlMOHe",
+        "outputId": "ef05231a-ce56-4733-b0be-d6b423a143ae"
+      },
+      "execution_count": null,
+      "outputs": [
+        {
+          "output_type": "execute_result",
+          "data": {
+            "text/plain": [
+              "CompletedProcess(args=['ffmpeg', '-i', 'IMG_8137.mp4', '-q:a', '0', '-map', 'a', 'audios/audio.wav', '-y'], returncode=0)"
+            ]
+          },
+          "metadata": {},
+          "execution_count": 57
+        }
+      ]
+    },
+    {
+      "cell_type": "code",
+      "source": [
+        "import cv2\n",
+        "from PIL import Image\n",
+        "import numpy as np\n",
+        "\n",
+        "def downsample_video(video_path):\n",
+        "    vidcap = cv2.VideoCapture(video_path)\n",
+        "    total_frames = int(vidcap.get(cv2.CAP_PROP_FRAME_COUNT))\n",
+        "    fps = vidcap.get(cv2.CAP_PROP_FPS)\n",
+        "\n",
+        "    frames = []\n",
+        "    frame_indices = np.linspace(0, total_frames - 1, 7, dtype=int)\n",
+        "\n",
+        "    for i in frame_indices:\n",
+        "        vidcap.set(cv2.CAP_PROP_POS_FRAMES, i)\n",
+        "        success, image = vidcap.read()\n",
+        "        if success:\n",
+        "            image = cv2.cvtColor(image, cv2.COLOR_BGR2RGB) # Convert from BGR to RGB\n",
+        "            pil_image = Image.fromarray(image)\n",
+        "            timestamp = round(i / fps, 2)\n",
+        "            frames.append((pil_image, timestamp))\n",
+        "\n",
+        "    vidcap.release()\n",
+        "    return frames\n"
+      ],
+      "metadata": {
+        "id": "6e_cExwMjx7v"
+      },
+      "execution_count": null,
+      "outputs": []
+    },
+    {
+      "cell_type": "markdown",
+      "source": [
+        "We will generate descriptions to videos and compare them to irl description in the metadata for the vibecheck.\n",
+        "\n",
+        "We need to downsample video to frames."
+      ],
+      "metadata": {
+        "id": "mRKCPRabuMs6"
+      }
+    },
+    {
+      "cell_type": "code",
+      "source": [
+        "frames = downsample_video(filename)"
+      ],
+      "metadata": {
+        "id": "UMJESbFulYTi"
+      },
+      "execution_count": null,
+      "outputs": []
+    },
+    {
+      "cell_type": "code",
+      "source": [
+        "frames"
+      ],
+      "metadata": {
+        "colab": {
+          "base_uri": "https://localhost:8080/"
+        },
+        "id": "wJKdYXasMfEG",
+        "outputId": "2cff578c-df4d-41ca-8d9e-f85b4fed3456"
+      },
+      "execution_count": null,
+      "outputs": [
+        {
+          "output_type": "execute_result",
+          "data": {
+            "text/plain": [
+              "[(<PIL.Image.Image image mode=RGB size=1080x1920>, np.float64(0.0)),\n",
+              " (<PIL.Image.Image image mode=RGB size=1080x1920>, np.float64(1.03)),\n",
+              " (<PIL.Image.Image image mode=RGB size=1080x1920>, np.float64(2.09)),\n",
+              " (<PIL.Image.Image image mode=RGB size=1080x1920>, np.float64(3.12)),\n",
+              " (<PIL.Image.Image image mode=RGB size=1080x1920>, np.float64(4.17)),\n",
+              " (<PIL.Image.Image image mode=RGB size=1080x1920>, np.float64(5.21)),\n",
+              " (<PIL.Image.Image image mode=RGB size=1080x1920>, np.float64(6.26))]"
+            ]
+          },
+          "metadata": {},
+          "execution_count": 52
+        }
+      ]
+    },
+    {
+      "cell_type": "code",
+      "source": [
+        "messages = [\n",
+        "    {\n",
+        "        \"role\": \"system\",\n",
+        "        \"content\": [{\"type\": \"text\", \"text\": \"You are a helpful assistant.\"}]\n",
+        "    },\n",
+        "    {\n",
+        "        \"role\": \"user\",\n",
+        "        \"content\": [\n",
+        "            {\"type\": \"text\", \"text\": f\"What is happening in this video? Summarize the events.\"}]\n",
+        "    }\n",
+        "]\n",
+        "for frame in frames:\n",
+        "    image, timestamp = frame\n",
+        "    messages[1][\"content\"].append({\"type\": \"text\", \"text\": f\"Frame {timestamp}: \"})\n",
+        "    image.save(f\"image_{timestamp}.png\")\n",
+        "    messages[1][\"content\"].append({\"type\": \"image\", \"url\": f\"./image_{timestamp}.png\"})\n",
+        "messages[1][\"content\"].append({\"type\": \"audio\", \"audio\": f\"audios/audio.wav\"})"
+      ],
+      "metadata": {
+        "id": "u8itVHCflZYQ"
+      },
+      "execution_count": null,
+      "outputs": []
+    },
+    {
+      "cell_type": "code",
+      "source": [
+        "messages"
+      ],
+      "metadata": {
+        "id": "dBX4mNxXxGoC",
+        "colab": {
+          "base_uri": "https://localhost:8080/"
+        },
+        "outputId": "b738e828-bf9b-4f13-bbb2-9f38bea50b6a"
+      },
+      "execution_count": null,
+      "outputs": [
+        {
+          "output_type": "execute_result",
+          "data": {
+            "text/plain": [
+              "[{'role': 'system',\n",
+              "  'content': [{'type': 'text', 'text': 'You are a helpful assistant.'}]},\n",
+              " {'role': 'user',\n",
+              "  'content': [{'type': 'text',\n",
+              "    'text': 'What is happening in this video? Summarize the events.'},\n",
+              "   {'type': 'text', 'text': 'Frame 0.0: '},\n",
+              "   {'type': 'image', 'url': './image_0.0.png'},\n",
+              "   {'type': 'text', 'text': 'Frame 1.03: '},\n",
+              "   {'type': 'image', 'url': './image_1.03.png'},\n",
+              "   {'type': 'text', 'text': 'Frame 2.09: '},\n",
+              "   {'type': 'image', 'url': './image_2.09.png'},\n",
+              "   {'type': 'text', 'text': 'Frame 3.12: '},\n",
+              "   {'type': 'image', 'url': './image_3.12.png'},\n",
+              "   {'type': 'text', 'text': 'Frame 4.17: '},\n",
+              "   {'type': 'image', 'url': './image_4.17.png'},\n",
+              "   {'type': 'text', 'text': 'Frame 5.21: '},\n",
+              "   {'type': 'image', 'url': './image_5.21.png'},\n",
+              "   {'type': 'text', 'text': 'Frame 6.26: '},\n",
+              "   {'type': 'image', 'url': './image_6.26.png'},\n",
+              "   {'type': 'audio', 'audio': 'audios/audio.wav'}]}]"
+            ]
+          },
+          "metadata": {},
+          "execution_count": 59
+        }
+      ]
+    },
+    {
+      "cell_type": "code",
+      "source": [
+        "#processor.tokenizer.padding_side = \"right\"\n",
+        "inputs = processor.apply_chat_template(\n",
+        "    messages, add_generation_prompt=True, tokenize=True,\n",
+        "    return_dict=True, return_tensors=\"pt\"\n",
+        ").to(model.device).to(model.dtype)"
+      ],
+      "metadata": {
+        "id": "e4f0qr67lcjo"
+      },
+      "execution_count": null,
+      "outputs": []
+    },
+    {
+      "cell_type": "code",
+      "source": [
+        "inputs[\"input_ids\"].shape[-1]"
+      ],
+      "metadata": {
+        "colab": {
+          "base_uri": "https://localhost:8080/"
+        },
+        "id": "EOiBpgkI9kXi",
+        "outputId": "911a6013-f76f-4fed-c402-8039d67b1e05"
+      },
+      "execution_count": null,
+      "outputs": [
+        {
+          "output_type": "execute_result",
+          "data": {
+            "text/plain": [
+              "2087"
+            ]
+          },
+          "metadata": {},
+          "execution_count": 61
+        }
+      ]
+    },
+    {
+      "cell_type": "code",
+      "source": [
+        "with torch.inference_mode():\n",
+        "    generation = model.generate(**inputs, max_new_tokens=200, do_sample=False)"
+      ],
+      "metadata": {
+        "id": "yJ95UXBqvXPM",
+        "colab": {
+          "base_uri": "https://localhost:8080/"
+        },
+        "outputId": "721839dc-aa78-401b-e802-b858690980da"
+      },
+      "execution_count": null,
+      "outputs": [
+        {
+          "output_type": "stream",
+          "name": "stderr",
+          "text": [
+            "The following generation flags are not valid and may be ignored: ['top_p', 'top_k']. Set `TRANSFORMERS_VERBOSITY=info` for more details.\n"
+          ]
+        }
+      ]
+    },
+    {
+      "cell_type": "code",
+      "source": [
+        "input_len = inputs[\"input_ids\"].shape[-1]\n",
+        "\n",
+        "generation = generation[0][input_len:]\n",
+        "\n",
+        "decoded = processor.decode(generation, skip_special_tokens=True)\n",
+        "print(decoded)"
+      ],
+      "metadata": {
+        "colab": {
+          "base_uri": "https://localhost:8080/"
+        },
+        "id": "3ifVZy9c74St",
+        "outputId": "f8ab51c6-e5a3-4a16-875b-d07404041396"
+      },
+      "execution_count": null,
+      "outputs": [
+        {
+          "output_type": "stream",
+          "name": "stdout",
+          "text": [
+            "Here's a summary of what's happening in the video:\n",
+            "\n",
+            "The video appears to be taken at a ski resort. The main subject is a person snowboarding down a snowy slope. \n",
+            "\n",
+            "**Initial Scene (0.0 - 1.03):** The snowboarder is initially positioned on the slope, seemingly having fallen or stopped. Other skiers and snowboarders are visible in the background, waiting at what looks like a lift station.\n",
+            "\n",
+            "**Mid-Video (1.03 - 6.26):** The snowboarder gets back up and continues down the slope. They navigate past other people, including skiers and snowboarders, and eventually reach a lift station. The video shows the snowboarder interacting with others at the lift, possibly waiting for the lift to start or having just gotten off. There are also other skiers and snowboarders around the lift station.\n",
+            "\n",
+            "**End Scene (6.26):** The snowboarder is still at the lift station,\n"
+          ]
+        }
+      ]
+    }
+  ]
+}

Idefics_FT.ipynb ADDED Viewed

	@@ -0,0 +1,1866 @@

+{
+ "cells": [
+  {
+   "cell_type": "markdown",
+   "metadata": {
+    "id": "nc0g2NLpUSGr"
+   },
+   "source": [
+    "# Fine-tune IDEFICS3 on Visual Question Answering\n",
+    "\n",
+    "In this notebook we will fine-tune IDEFICS3 on VQAv2 dataset.\n",
+    "\n",
+    "The transformers PR isn't merged yet so we will install the branch that contains the transformers implementation"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {
+    "colab": {
+     "base_uri": "https://localhost:8080/"
+    },
+    "id": "qttWxowEhlRt",
+    "outputId": "ca8d1fd2-ed88-4aef-f8a4-8f60df269b70"
+   },
+   "outputs": [],
+   "source": [
+    "!git clone https://github.com/andimarafioti/transformers.git"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 2,
+   "metadata": {
+    "colab": {
+     "base_uri": "https://localhost:8080/"
+    },
+    "id": "qttWxowEhlRt",
+    "outputId": "ca8d1fd2-ed88-4aef-f8a4-8f60df269b70"
+   },
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "/home/merve/transformers\n"
+     ]
+    },
+    {
+     "name": "stderr",
+     "output_type": "stream",
+     "text": [
+      "/home/merve/anaconda3/envs/py311_env/lib/python3.11/site-packages/IPython/core/magics/osm.py:417: UserWarning: using dhist requires you to install the `pickleshare` library.\n",
+      "  self.shell.db['dhist'] = compress_dhist(dhist)[-100:]\n"
+     ]
+    }
+   ],
+   "source": [
+    "%cd transformers"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 3,
+   "metadata": {
+    "colab": {
+     "base_uri": "https://localhost:8080/"
+    },
+    "id": "qttWxowEhlRt",
+    "outputId": "ca8d1fd2-ed88-4aef-f8a4-8f60df269b70"
+   },
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "Previous HEAD position was a72b30fe0 hot fix for merve\n",
+      "Switched to branch 'idefics3'\n",
+      "Your branch is up to date with 'origin/idefics3'.\n"
+     ]
+    }
+   ],
+   "source": [
+    "!git checkout idefics3"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 10,
+   "metadata": {
+    "colab": {
+     "base_uri": "https://localhost:8080/"
+    },
+    "id": "qttWxowEhlRt",
+    "outputId": "ca8d1fd2-ed88-4aef-f8a4-8f60df269b70"
+   },
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "Note: switching to 'a72b30fe06bba77d9df4c72fcea48bbdc0d812a5'.\n",
+      "\n",
+      "You are in 'detached HEAD' state. You can look around, make experimental\n",
+      "changes and commit them, and you can discard any commits you make in this\n",
+      "state without impacting any branches by switching back to a branch.\n",
+      "\n",
+      "If you want to create a new branch to retain commits you create, you may\n",
+      "do so (now or later) by using -c with the switch command. Example:\n",
+      "\n",
+      "  git switch -c <new-branch-name>\n",
+      "\n",
+      "Or undo this operation with:\n",
+      "\n",
+      "  git switch -\n",
+      "\n",
+      "Turn off this advice by setting config variable advice.detachedHead to false\n",
+      "\n",
+      "HEAD is now at a72b30fe0 hot fix for merve\n"
+     ]
+    }
+   ],
+   "source": [
+    "!git checkout a72b30fe06bba77d9df4c72fcea48bbdc0d812a5"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {
+    "colab": {
+     "base_uri": "https://localhost:8080/"
+    },
+    "id": "qttWxowEhlRt",
+    "outputId": "ca8d1fd2-ed88-4aef-f8a4-8f60df269b70"
+   },
+   "outputs": [],
+   "source": [
+    "!pip install -q \".\""
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 12,
+   "metadata": {
+    "colab": {
+     "base_uri": "https://localhost:8080/"
+    },
+    "id": "WIhA1lQ7j0kw",
+    "outputId": "75d422a4-e258-455d-9b48-9fba36c060c3"
+   },
+   "outputs": [],
+   "source": [
+    "!pip install -q accelerate datasets peft bitsandbytes"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 13,
+   "metadata": {
+    "colab": {
+     "base_uri": "https://localhost:8080/"
+    },
+    "id": "WIhA1lQ7j0kw",
+    "outputId": "75d422a4-e258-455d-9b48-9fba36c060c3"
+   },
+   "outputs": [],
+   "source": [
+    "!pip install -q flash-attn --no-build-isolation"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {
+    "id": "wAeMA0heVBjT"
+   },
+   "source": [
+    "We will push out model to Hub so we need to authenticate ourselves."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {
+    "colab": {
+     "base_uri": "https://localhost:8080/",
+     "height": 145,
+     "referenced_widgets": [
+      "02f82c31a81c4f60a53644ac17e35ffd",
+      "2a08bc280647423188a7da9a87693167",
+      "0d0713e8a8624ac8bf79830c9553ff32",
+      "e01c3514a6904c79b5646688a515ca10",
+      "02adc6bce181453d9d18aea4fb1110be",
+      "224ed6bcd8c04e6fab9ef6c145630e39",
+      "e1063035ef1e42768dd984653d992137",
+      "bc34088935944cc8b02b2386239a3639",
+      "0d62633c4df246abb3d72f8c87d9cfb9",
+      "5bbba35b612247ea946b8844807dbb42",
+      "45242b70a62b4ebfbe4aac00c904bcc8",
+      "893b6342058945448a3861eb2c1c3a41",
+      "f8af58e353b94164b34d2ce064252dc2",
+      "45da78de95c4464d9eb60709ff94cc1a",
+      "4573bad8837142d0b1f063d568a771c6",
+      "cabd977f2993428d91fc75df5a15328e",
+      "bf91e9029f394c35874b4d35d61dd2c8",
+      "010cc98b3522423d86f89140fd7e1222",
+      "9d5a7a4379ce4e3493e7e050bfb173dc",
+      "4988f3cbc5164c499598c83a5b3a665b",
+      "b6ca0bfe87874730907ef1a4c500863a",
+      "8110e462f10b413e8dc59171fb84a13a",
+      "526fda6c78374906b7c1b93e5f973b25",
+      "d00821b88efa4256b29d52fe816a7c89",
+      "142c966c31fe4e5f99031da317e2ff54",
+      "d67b40ceeba8412f91ac885cb816eb01",
+      "a43226fe11eb4ec28c9619d7ee3a4618",
+      "e7878cd9245b4e56b172e40008df453b",
+      "5d09e1657d3e405f98d1a948c5c0c022",
+      "3429b2b924484bd2a45dfb6f186db6bc",
+      "0424a259c1d34333951b757c3c705b6f",
+      "89b4a59adc4942b290a9f3158b89423f"
+     ]
+    },
+    "id": "yKd5xtSGj7cm",
+    "outputId": "ca0d369a-8e70-46a6-bb77-50ed3509ab39"
+   },
+   "outputs": [],
+   "source": [
+    "from huggingface_hub import notebook_login\n",
+    "\n",
+    "notebook_login()"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {
+    "id": "WRq8ve-LVAzU"
+   },
+   "source": [
+    "In this notebook we will not do full fine-tuning but use QLoRA method, which loads an adapter to the quantized version of the model, saving space. If you want to do full fine-tuning, set `USE_LORA` and `USE_QLORA` to False. If you want to do LoRA, set `USE_QLORA` to False and `USE_LORA` to True."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 1,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "import os\n",
+    "os.environ[\"CUDA_DEVICE_ORDER\"] = \"PCI_BUS_ID\"\n",
+    "os.environ[\"CUDA_VISIBLE_DEVICES\"] = \"4\" # you don't need this unless you work on a multigpu setup and need to use a specific index\n",
+    "# if you want to use multiple GPUs, use e.g. \"2,4\""
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 8,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "for param in model.model.vision_model.parameters():\n",
+    "    param.requires_grad = False "
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {
+    "id": "QtjggkcTVnSV"
+   },
+   "source": [
+    "We will load VQAv2 dataset. For educational purposes we will load the validation split and split it twice."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 9,
+   "metadata": {
+    "colab": {
+     "base_uri": "https://localhost:8080/"
+    },
+    "id": "POOqKqYRka5O",
+    "outputId": "87977922-2c3a-4c96-fffb-7c097b0815fa"
+   },
+   "outputs": [],
+   "source": [
+    "from datasets import load_dataset\n",
+    "ds = load_dataset('merve/vqav2-small', trust_remote_code=True)"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 10,
+   "metadata": {
+    "id": "Znf9vMo5rnSd"
+   },
+   "outputs": [],
+   "source": [
+    "split_ds = ds[\"validation\"].train_test_split(test_size=0.8)\n",
+    "train_ds = split_ds[\"train\"]"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 11,
+   "metadata": {},
+   "outputs": [
+    {
+     "data": {
+      "text/plain": [
+       "Dataset({\n",
+       "    features: ['multiple_choice_answer', 'question', 'image'],\n",
+       "    num_rows: 4287\n",
+       "})"
+      ]
+     },
+     "execution_count": 11,
+     "metadata": {},
+     "output_type": "execute_result"
+    }
+   ],
+   "source": [
+    "train_ds"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "import torch\n",
+    "from peft import LoraConfig, prepare_model_for_kbit_training, get_peft_model\n",
+    "from transformers import AutoProcessor, BitsAndBytesConfig, Idefics3ForConditionalGeneration\n",
+    "\n",
+    "USE_LORA = False\n",
+    "USE_QLORA = False\n",
+    "model_id = \"HuggingFaceM4/Idefics3-8B-Llama3\"\n",
+    "\n",
+    "processor = AutoProcessor.from_pretrained(\n",
+    "    model_id\n",
+    ")\n",
+    "\n",
+    "if USE_QLORA or USE_LORA:\n",
+    "    lora_config = LoraConfig(\n",
+    "        r=8,\n",
+    "        lora_alpha=8,\n",
+    "        lora_dropout=0.1,\n",
+    "        target_modules=['down_proj','o_proj','k_proj','q_proj','gate_proj','up_proj','v_proj'],\n",
+    "        use_dora=False if USE_QLORA else True,\n",
+    "        init_lora_weights=\"gaussian\"\n",
+    "    )\n",
+    "    lora_config.inference_mode = False\n",
+    "    if USE_QLORA:\n",
+    "        bnb_config = BitsAndBytesConfig(\n",
+    "            load_in_4bit=True,\n",
+    "            bnb_4bit_use_double_quant=True,\n",
+    "            bnb_4bit_quant_type=\"nf4\",\n",
+    "            bnb_4bit_compute_dtype=torch.bfloat16\n",
+    "        )\n",
+    "        \n",
+    "    model = Idefics3ForConditionalGeneration.from_pretrained(\n",
+    "        model_id,\n",
+    "        quantization_config=bnb_config if USE_QLORA else None,\n",
+    "        _attn_implementation=\"flash_attention_2\",\n",
+    "        device_map=\"auto\"\n",
+    "    )\n",
+    "    model.add_adapter(lora_config)\n",
+    "    model.enable_adapters()\n",
+    "    model = prepare_model_for_kbit_training(model)\n",
+    "    model = get_peft_model(model, lora_config)\n",
+    "    print(model.get_nb_trainable_parameters())\n",
+    "else:\n",
+    "    model = Idefics3ForConditionalGeneration.from_pretrained(\n",
+    "        model_id,\n",
+    "        torch_dtype=torch.bfloat16,\n",
+    "        _attn_implementation=\"flash_attention_2\",\n",
+    "    ).to(DEVICE)\n",
+    "    \n",
+    "    # if you'd like to only fine-tune LLM\n",
+    "    for param in model.model.vision_model.parameters():\n",
+    "        param.requires_grad = False"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {
+    "id": "5nwMO3n0X7Hv"
+   },
+   "source": [
+    "Let's write our data collating function. We will apply prompt template to have questions and answers together so model can learn to answer. Then we pass the formatted prompts and images to the processor which processes both."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 12,
+   "metadata": {
+    "id": "e0krVLZ-wNMl"
+   },
+   "outputs": [],
+   "source": [
+    "image_token_id = processor.tokenizer.additional_special_tokens_ids[\n",
+    "            processor.tokenizer.additional_special_tokens.index(\"<image>\")]\n",
+    "\n",
+    "def collate_fn(examples):\n",
+    "  texts = []\n",
+    "  images = []\n",
+    "  for example in examples:\n",
+    "      image = example[\"image\"]\n",
+    "      question = example[\"question\"]\n",
+    "      answer = example[\"multiple_choice_answer\"]\n",
+    "      messages = [\n",
+    "          {\n",
+    "              \"role\": \"user\",\n",
+    "              \"content\": [\n",
+    "                  {\"type\": \"text\", \"text\": \"Answer briefly.\"},\n",
+    "                  {\"type\": \"image\"},\n",
+    "                  {\"type\": \"text\", \"text\": question}\n",
+    "              ]\n",
+    "          },\n",
+    "          {\n",
+    "              \"role\": \"assistant\",\n",
+    "              \"content\": [\n",
+    "                  {\"type\": \"text\", \"text\": answer}\n",
+    "              ]\n",
+    "          }\n",
+    "      ]\n",
+    "      text = processor.apply_chat_template(messages, add_generation_prompt=False)\n",
+    "      texts.append(text.strip())\n",
+    "      images.append([image])\n",
+    "\n",
+    "  batch = processor(text=texts, images=images, return_tensors=\"pt\", padding=True)\n",
+    "  labels = batch[\"input_ids\"].clone()\n",
+    "  labels[labels == processor.tokenizer.pad_token_id] = -100\n",
+    "  labels[labels == image_token_id] = -100 \n",
+    "  batch[\"labels\"] = labels\n",
+    "\n",
+    "  return batch\n"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {
+    "id": "QvAs896cdwg8"
+   },
+   "source": [
+    "We can now initialize `Trainer` and initialize `TrainingArguments` to pass to `Trainer`."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 14,
+   "metadata": {
+    "colab": {
+     "base_uri": "https://localhost:8080/"
+    },
+    "id": "QNE2yWAYrAhD",
+    "outputId": "2bdefa08-a54b-40e0-cae8-f029ff8312e7"
+   },
+   "outputs": [],
+   "source": [
+    "from transformers import TrainingArguments, Trainer\n",
+    "\n",
+    "training_args = TrainingArguments(\n",
+    "    num_train_epochs=1,\n",
+    "    per_device_train_batch_size=2,\n",
+    "    gradient_accumulation_steps=8,\n",
+    "    warmup_steps=50,\n",
+    "    learning_rate=1e-4,\n",
+    "    weight_decay=0.01,\n",
+    "    logging_steps=25,\n",
+    "    save_strategy=\"steps\",\n",
+    "    save_steps=250,\n",
+    "    save_total_limit=1,\n",
+    "    optim=\"adamw_hf\", # for 8-bit, pick paged_adamw_hf\n",
+    "    #evaluation_strategy=\"epoch\",\n",
+    "    bf16=True,\n",
+    "    output_dir=\"./idefics3-llama-vqav2\",\n",
+    "    hub_model_id=\"idefics3-llama-vqav2\",\n",
+    "    remove_unused_columns=False,\n",
+    ")\n"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 15,
+   "metadata": {
+    "id": "oBBSDpBhreJd"
+   },
+   "outputs": [
+    {
+     "name": "stderr",
+     "output_type": "stream",
+     "text": [
+      "Detected kernel version 5.4.0, which is below the recommended minimum of 5.5.0; this can cause the process to hang. It is recommended to upgrade the kernel to the minimum version or higher.\n"
+     ]
+    }
+   ],
+   "source": [
+    "trainer = Trainer(\n",
+    "    model=model,\n",
+    "    args=training_args,\n",
+    "    data_collator=collate_fn,\n",
+    "    train_dataset=train_ds,\n",
+    "    #eval_dataset=test_ds,\n",
+    ")"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "I'm running standalone scripts on top of tmux so the logs will not appear here. I will upload my training script to this repository."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "trainer.train()"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "trainer.push_to_hub()"
+   ]
+  }
+ ],
+ "metadata": {
+  "accelerator": "GPU",
+  "colab": {
+   "gpuType": "A100",
+   "machine_shape": "hm",
+   "provenance": []
+  },
+  "kernelspec": {
+   "display_name": "py311_env",
+   "language": "python",
+   "name": "py311_env"
+  },
+  "language_info": {
+   "codemirror_mode": {
+    "name": "ipython",
+    "version": 3
+   },
+   "file_extension": ".py",
+   "mimetype": "text/x-python",
+   "name": "python",
+   "nbconvert_exporter": "python",
+   "pygments_lexer": "ipython3",
+   "version": "3.11.9"
+  },
+  "widgets": {
+   "application/vnd.jupyter.widget-state+json": {
+    "010cc98b3522423d86f89140fd7e1222": {
+     "model_module": "@jupyter-widgets/controls",
+     "model_module_version": "1.5.0",
+     "model_name": "LabelModel",
+     "state": {
+      "_dom_classes": [],
+      "_model_module": "@jupyter-widgets/controls",
+      "_model_module_version": "1.5.0",
+      "_model_name": "LabelModel",
+      "_view_count": null,
+      "_view_module": "@jupyter-widgets/controls",
+      "_view_module_version": "1.5.0",
+      "_view_name": "LabelView",
+      "description": "",
+      "description_tooltip": null,
+      "layout": "IPY_MODEL_9d5a7a4379ce4e3493e7e050bfb173dc",
+      "placeholder": "",
+      "style": "IPY_MODEL_4988f3cbc5164c499598c83a5b3a665b",
+      "value": "Connecting..."
+     }
+    },
+    "02adc6bce181453d9d18aea4fb1110be": {
+     "model_module": "@jupyter-widgets/controls",
+     "model_module_version": "1.5.0",
+     "model_name": "ButtonModel",
+     "state": {
+      "_dom_classes": [],
+      "_model_module": "@jupyter-widgets/controls",
+      "_model_module_version": "1.5.0",
+      "_model_name": "ButtonModel",
+      "_view_count": null,
+      "_view_module": "@jupyter-widgets/controls",
+      "_view_module_version": "1.5.0",
+      "_view_name": "ButtonView",
+      "button_style": "",
+      "description": "Login",
+      "disabled": false,
+      "icon": "",
+      "layout": "IPY_MODEL_45da78de95c4464d9eb60709ff94cc1a",
+      "style": "IPY_MODEL_4573bad8837142d0b1f063d568a771c6",
+      "tooltip": ""
+     }
+    },
+    "02f82c31a81c4f60a53644ac17e35ffd": {
+     "model_module": "@jupyter-widgets/controls",
+     "model_module_version": "1.5.0",
+     "model_name": "VBoxModel",
+     "state": {
+      "_dom_classes": [],
+      "_model_module": "@jupyter-widgets/controls",
+      "_model_module_version": "1.5.0",
+      "_model_name": "VBoxModel",
+      "_view_count": null,
+      "_view_module": "@jupyter-widgets/controls",
+      "_view_module_version": "1.5.0",
+      "_view_name": "VBoxView",
+      "box_style": "",
+      "children": [
+       "IPY_MODEL_b6ca0bfe87874730907ef1a4c500863a",
+       "IPY_MODEL_8110e462f10b413e8dc59171fb84a13a",
+       "IPY_MODEL_526fda6c78374906b7c1b93e5f973b25",
+       "IPY_MODEL_d00821b88efa4256b29d52fe816a7c89"
+      ],
+      "layout": "IPY_MODEL_e1063035ef1e42768dd984653d992137"
+     }
+    },
+    "0424a259c1d34333951b757c3c705b6f": {
+     "model_module": "@jupyter-widgets/base",
+     "model_module_version": "1.2.0",
+     "model_name": "LayoutModel",
+     "state": {
+      "_model_module": "@jupyter-widgets/base",
+      "_model_module_version": "1.2.0",
+      "_model_name": "LayoutModel",
+      "_view_count": null,
+      "_view_module": "@jupyter-widgets/base",
+      "_view_module_version": "1.2.0",
+      "_view_name": "LayoutView",
+      "align_content": null,
+      "align_items": null,
+      "align_self": null,
+      "border": null,
+      "bottom": null,
+      "display": null,
+      "flex": null,
+      "flex_flow": null,
+      "grid_area": null,
+      "grid_auto_columns": null,
+      "grid_auto_flow": null,
+      "grid_auto_rows": null,
+      "grid_column": null,
+      "grid_gap": null,
+      "grid_row": null,
+      "grid_template_areas": null,
+      "grid_template_columns": null,
+      "grid_template_rows": null,
+      "height": null,
+      "justify_content": null,
+      "justify_items": null,
+      "left": null,
+      "margin": null,
+      "max_height": null,
+      "max_width": null,
+      "min_height": null,
+      "min_width": null,
+      "object_fit": null,
+      "object_position": null,
+      "order": null,
+      "overflow": null,
+      "overflow_x": null,
+      "overflow_y": null,
+      "padding": null,
+      "right": null,
+      "top": null,
+      "visibility": null,
+      "width": null
+     }
+    },
+    "0d0713e8a8624ac8bf79830c9553ff32": {
+     "model_module": "@jupyter-widgets/controls",
+     "model_module_version": "1.5.0",
+     "model_name": "PasswordModel",
+     "state": {
+      "_dom_classes": [],
+      "_model_module": "@jupyter-widgets/controls",
+      "_model_module_version": "1.5.0",
+      "_model_name": "PasswordModel",
+      "_view_count": null,
+      "_view_module": "@jupyter-widgets/controls",
+      "_view_module_version": "1.5.0",
+      "_view_name": "PasswordView",
+      "continuous_update": true,
+      "description": "Token:",
+      "description_tooltip": null,
+      "disabled": false,
+      "layout": "IPY_MODEL_5bbba35b612247ea946b8844807dbb42",
+      "placeholder": "",
+      "style": "IPY_MODEL_45242b70a62b4ebfbe4aac00c904bcc8",
+      "value": ""
+     }
+    },
+    "0d62633c4df246abb3d72f8c87d9cfb9": {
+     "model_module": "@jupyter-widgets/controls",
+     "model_module_version": "1.5.0",
+     "model_name": "DescriptionStyleModel",
+     "state": {
+      "_model_module": "@jupyter-widgets/controls",
+      "_model_module_version": "1.5.0",
+      "_model_name": "DescriptionStyleModel",
+      "_view_count": null,
+      "_view_module": "@jupyter-widgets/base",
+      "_view_module_version": "1.2.0",
+      "_view_name": "StyleView",
+      "description_width": ""
+     }
+    },
+    "142c966c31fe4e5f99031da317e2ff54": {
+     "model_module": "@jupyter-widgets/base",
+     "model_module_version": "1.2.0",
+     "model_name": "LayoutModel",
+     "state": {
+      "_model_module": "@jupyter-widgets/base",
+      "_model_module_version": "1.2.0",
+      "_model_name": "LayoutModel",
+      "_view_count": null,
+      "_view_module": "@jupyter-widgets/base",
+      "_view_module_version": "1.2.0",
+      "_view_name": "LayoutView",
+      "align_content": null,
+      "align_items": null,
+      "align_self": null,
+      "border": null,
+      "bottom": null,
+      "display": null,
+      "flex": null,
+      "flex_flow": null,
+      "grid_area": null,
+      "grid_auto_columns": null,
+      "grid_auto_flow": null,
+      "grid_auto_rows": null,
+      "grid_column": null,
+      "grid_gap": null,
+      "grid_row": null,
+      "grid_template_areas": null,
+      "grid_template_columns": null,
+      "grid_template_rows": null,
+      "height": null,
+      "justify_content": null,
+      "justify_items": null,
+      "left": null,
+      "margin": null,
+      "max_height": null,
+      "max_width": null,
+      "min_height": null,
+      "min_width": null,
+      "object_fit": null,
+      "object_position": null,
+      "order": null,
+      "overflow": null,
+      "overflow_x": null,
+      "overflow_y": null,
+      "padding": null,
+      "right": null,
+      "top": null,
+      "visibility": null,
+      "width": null
+     }
+    },
+    "224ed6bcd8c04e6fab9ef6c145630e39": {
+     "model_module": "@jupyter-widgets/controls",
+     "model_module_version": "1.5.0",
+     "model_name": "HTMLModel",
+     "state": {
+      "_dom_classes": [],
+      "_model_module": "@jupyter-widgets/controls",
+      "_model_module_version": "1.5.0",
+      "_model_name": "HTMLModel",
+      "_view_count": null,
+      "_view_module": "@jupyter-widgets/controls",
+      "_view_module_version": "1.5.0",
+      "_view_name": "HTMLView",
+      "description": "",
+      "description_tooltip": null,
+      "layout": "IPY_MODEL_cabd977f2993428d91fc75df5a15328e",
+      "placeholder": "",
+      "style": "IPY_MODEL_bf91e9029f394c35874b4d35d61dd2c8",
+      "value": "\n<b>Pro Tip:</b> If you don't already have one, you can create a dedicated\n'notebooks' token with 'write' access, that you can then easily reuse for all\nnotebooks. </center>"
+     }
+    },
+    "2641dbfe060e466bbe10038004edefc0": {
+     "model_module": "@jupyter-widgets/base",
+     "model_module_version": "1.2.0",
+     "model_name": "LayoutModel",
+     "state": {
+      "_model_module": "@jupyter-widgets/base",
+      "_model_module_version": "1.2.0",
+      "_model_name": "LayoutModel",
+      "_view_count": null,
+      "_view_module": "@jupyter-widgets/base",
+      "_view_module_version": "1.2.0",
+      "_view_name": "LayoutView",
+      "align_content": null,
+      "align_items": null,
+      "align_self": null,
+      "border": null,
+      "bottom": null,
+      "display": null,
+      "flex": null,
+      "flex_flow": null,
+      "grid_area": null,
+      "grid_auto_columns": null,
+      "grid_auto_flow": null,
+      "grid_auto_rows": null,
+      "grid_column": null,
+      "grid_gap": null,
+      "grid_row": null,
+      "grid_template_areas": null,
+      "grid_template_columns": null,
+      "grid_template_rows": null,
+      "height": null,
+      "justify_content": null,
+      "justify_items": null,
+      "left": null,
+      "margin": null,
+      "max_height": null,
+      "max_width": null,
+      "min_height": null,
+      "min_width": null,
+      "object_fit": null,
+      "object_position": null,
+      "order": null,
+      "overflow": null,
+      "overflow_x": null,
+      "overflow_y": null,
+      "padding": null,
+      "right": null,
+      "top": null,
+      "visibility": null,
+      "width": null
+     }
+    },
+    "2a08bc280647423188a7da9a87693167": {
+     "model_module": "@jupyter-widgets/controls",
+     "model_module_version": "1.5.0",
+     "model_name": "HTMLModel",
+     "state": {
+      "_dom_classes": [],
+      "_model_module": "@jupyter-widgets/controls",
+      "_model_module_version": "1.5.0",
+      "_model_name": "HTMLModel",
+      "_view_count": null,
+      "_view_module": "@jupyter-widgets/controls",
+      "_view_module_version": "1.5.0",
+      "_view_name": "HTMLView",
+      "description": "",
+      "description_tooltip": null,
+      "layout": "IPY_MODEL_bc34088935944cc8b02b2386239a3639",
+      "placeholder": "",
+      "style": "IPY_MODEL_0d62633c4df246abb3d72f8c87d9cfb9",
+      "value": "<center> <img\nsrc=https://huggingface.co/front/assets/huggingface_logo-noborder.svg\nalt='Hugging Face'> <br> Copy a token from <a\nhref=\"https://huggingface.co/settings/tokens\" target=\"_blank\">your Hugging Face\ntokens page</a> and paste it below. <br> Immediately click login after copying\nyour token or it might be stored in plain text in this notebook file. </center>"
+     }
+    },
+    "3429b2b924484bd2a45dfb6f186db6bc": {
+     "model_module": "@jupyter-widgets/controls",
+     "model_module_version": "1.5.0",
+     "model_name": "DescriptionStyleModel",
+     "state": {
+      "_model_module": "@jupyter-widgets/controls",
+      "_model_module_version": "1.5.0",
+      "_model_name": "DescriptionStyleModel",
+      "_view_count": null,
+      "_view_module": "@jupyter-widgets/base",
+      "_view_module_version": "1.2.0",
+      "_view_name": "StyleView",
+      "description_width": ""
+     }
+    },
+    "377ff50b710f496ab9acb4554df58df2": {
+     "model_module": "@jupyter-widgets/controls",
+     "model_module_version": "1.5.0",
+     "model_name": "HTMLModel",
+     "state": {
+      "_dom_classes": [],
+      "_model_module": "@jupyter-widgets/controls",
+      "_model_module_version": "1.5.0",
+      "_model_name": "HTMLModel",
+      "_view_count": null,
+      "_view_module": "@jupyter-widgets/controls",
+      "_view_module_version": "1.5.0",
+      "_view_name": "HTMLView",
+      "description": "",
+      "description_tooltip": null,
+      "layout": "IPY_MODEL_b7b9ed6e0aa14e29bf27dbaf4e248f03",
+      "placeholder": "",
+      "style": "IPY_MODEL_c7d2883b47e741b4a0af037342ce88b3",
+      "value": "Loading checkpoint shards: 100%"
+     }
+    },
+    "41a1b28a5785496a967027f51bf74aca": {
+     "model_module": "@jupyter-widgets/controls",
+     "model_module_version": "1.5.0",
+     "model_name": "FloatProgressModel",
+     "state": {
+      "_dom_classes": [],
+      "_model_module": "@jupyter-widgets/controls",
+      "_model_module_version": "1.5.0",
+      "_model_name": "FloatProgressModel",
+      "_view_count": null,
+      "_view_module": "@jupyter-widgets/controls",
+      "_view_module_version": "1.5.0",
+      "_view_name": "ProgressView",
+      "bar_style": "success",
+      "description": "",
+      "description_tooltip": null,
+      "layout": "IPY_MODEL_9240afbf2b7346be8738c4017c4d91f1",
+      "max": 4,
+      "min": 0,
+      "orientation": "horizontal",
+      "style": "IPY_MODEL_e0c376506a5a4757a9d51a5f181f3fb6",
+      "value": 4
+     }
+    },
+    "45242b70a62b4ebfbe4aac00c904bcc8": {
+     "model_module": "@jupyter-widgets/controls",
+     "model_module_version": "1.5.0",
+     "model_name": "DescriptionStyleModel",
+     "state": {
+      "_model_module": "@jupyter-widgets/controls",
+      "_model_module_version": "1.5.0",
+      "_model_name": "DescriptionStyleModel",
+      "_view_count": null,
+      "_view_module": "@jupyter-widgets/base",
+      "_view_module_version": "1.2.0",
+      "_view_name": "StyleView",
+      "description_width": ""
+     }
+    },
+    "4573bad8837142d0b1f063d568a771c6": {
+     "model_module": "@jupyter-widgets/controls",
+     "model_module_version": "1.5.0",
+     "model_name": "ButtonStyleModel",
+     "state": {
+      "_model_module": "@jupyter-widgets/controls",
+      "_model_module_version": "1.5.0",
+      "_model_name": "ButtonStyleModel",
+      "_view_count": null,
+      "_view_module": "@jupyter-widgets/base",
+      "_view_module_version": "1.2.0",
+      "_view_name": "StyleView",
+      "button_color": null,
+      "font_weight": ""
+     }
+    },
+    "45da78de95c4464d9eb60709ff94cc1a": {
+     "model_module": "@jupyter-widgets/base",
+     "model_module_version": "1.2.0",
+     "model_name": "LayoutModel",
+     "state": {
+      "_model_module": "@jupyter-widgets/base",
+      "_model_module_version": "1.2.0",
+      "_model_name": "LayoutModel",
+      "_view_count": null,
+      "_view_module": "@jupyter-widgets/base",
+      "_view_module_version": "1.2.0",
+      "_view_name": "LayoutView",
+      "align_content": null,
+      "align_items": null,
+      "align_self": null,
+      "border": null,
+      "bottom": null,
+      "display": null,
+      "flex": null,
+      "flex_flow": null,
+      "grid_area": null,
+      "grid_auto_columns": null,
+      "grid_auto_flow": null,
+      "grid_auto_rows": null,
+      "grid_column": null,
+      "grid_gap": null,
+      "grid_row": null,
+      "grid_template_areas": null,
+      "grid_template_columns": null,
+      "grid_template_rows": null,
+      "height": null,
+      "justify_content": null,
+      "justify_items": null,
+      "left": null,
+      "margin": null,
+      "max_height": null,
+      "max_width": null,
+      "min_height": null,
+      "min_width": null,
+      "object_fit": null,
+      "object_position": null,
+      "order": null,
+      "overflow": null,
+      "overflow_x": null,
+      "overflow_y": null,
+      "padding": null,
+      "right": null,
+      "top": null,
+      "visibility": null,
+      "width": null
+     }
+    },
+    "4988f3cbc5164c499598c83a5b3a665b": {
+     "model_module": "@jupyter-widgets/controls",
+     "model_module_version": "1.5.0",
+     "model_name": "DescriptionStyleModel",
+     "state": {
+      "_model_module": "@jupyter-widgets/controls",
+      "_model_module_version": "1.5.0",
+      "_model_name": "DescriptionStyleModel",
+      "_view_count": null,
+      "_view_module": "@jupyter-widgets/base",
+      "_view_module_version": "1.2.0",
+      "_view_name": "StyleView",
+      "description_width": ""
+     }
+    },
+    "526fda6c78374906b7c1b93e5f973b25": {
+     "model_module": "@jupyter-widgets/controls",
+     "model_module_version": "1.5.0",
+     "model_name": "LabelModel",
+     "state": {
+      "_dom_classes": [],
+      "_model_module": "@jupyter-widgets/controls",
+      "_model_module_version": "1.5.0",
+      "_model_name": "LabelModel",
+      "_view_count": null,
+      "_view_module": "@jupyter-widgets/controls",
+      "_view_module_version": "1.5.0",
+      "_view_name": "LabelView",
+      "description": "",
+      "description_tooltip": null,
+      "layout": "IPY_MODEL_5d09e1657d3e405f98d1a948c5c0c022",
+      "placeholder": "",
+      "style": "IPY_MODEL_3429b2b924484bd2a45dfb6f186db6bc",
+      "value": "Your token has been saved to /root/.cache/huggingface/token"
+     }
+    },
+    "5bbba35b612247ea946b8844807dbb42": {
+     "model_module": "@jupyter-widgets/base",
+     "model_module_version": "1.2.0",
+     "model_name": "LayoutModel",
+     "state": {
+      "_model_module": "@jupyter-widgets/base",
+      "_model_module_version": "1.2.0",
+      "_model_name": "LayoutModel",
+      "_view_count": null,
+      "_view_module": "@jupyter-widgets/base",
+      "_view_module_version": "1.2.0",
+      "_view_name": "LayoutView",
+      "align_content": null,
+      "align_items": null,
+      "align_self": null,
+      "border": null,
+      "bottom": null,
+      "display": null,
+      "flex": null,
+      "flex_flow": null,
+      "grid_area": null,
+      "grid_auto_columns": null,
+      "grid_auto_flow": null,
+      "grid_auto_rows": null,
+      "grid_column": null,
+      "grid_gap": null,
+      "grid_row": null,
+      "grid_template_areas": null,
+      "grid_template_columns": null,
+      "grid_template_rows": null,
+      "height": null,
+      "justify_content": null,
+      "justify_items": null,
+      "left": null,
+      "margin": null,
+      "max_height": null,
+      "max_width": null,
+      "min_height": null,
+      "min_width": null,
+      "object_fit": null,
+      "object_position": null,
+      "order": null,
+      "overflow": null,
+      "overflow_x": null,
+      "overflow_y": null,
+      "padding": null,
+      "right": null,
+      "top": null,
+      "visibility": null,
+      "width": null
+     }
+    },
+    "5d09e1657d3e405f98d1a948c5c0c022": {
+     "model_module": "@jupyter-widgets/base",
+     "model_module_version": "1.2.0",
+     "model_name": "LayoutModel",
+     "state": {
+      "_model_module": "@jupyter-widgets/base",
+      "_model_module_version": "1.2.0",
+      "_model_name": "LayoutModel",
+      "_view_count": null,
+      "_view_module": "@jupyter-widgets/base",
+      "_view_module_version": "1.2.0",
+      "_view_name": "LayoutView",
+      "align_content": null,
+      "align_items": null,
+      "align_self": null,
+      "border": null,
+      "bottom": null,
+      "display": null,
+      "flex": null,
+      "flex_flow": null,
+      "grid_area": null,
+      "grid_auto_columns": null,
+      "grid_auto_flow": null,
+      "grid_auto_rows": null,
+      "grid_column": null,
+      "grid_gap": null,
+      "grid_row": null,
+      "grid_template_areas": null,
+      "grid_template_columns": null,
+      "grid_template_rows": null,
+      "height": null,
+      "justify_content": null,
+      "justify_items": null,
+      "left": null,
+      "margin": null,
+      "max_height": null,
+      "max_width": null,
+      "min_height": null,
+      "min_width": null,
+      "object_fit": null,
+      "object_position": null,
+      "order": null,
+      "overflow": null,
+      "overflow_x": null,
+      "overflow_y": null,
+      "padding": null,
+      "right": null,
+      "top": null,
+      "visibility": null,
+      "width": null
+     }
+    },
+    "5da7171e37014620b11ffc3fa623fc40": {
+     "model_module": "@jupyter-widgets/base",
+     "model_module_version": "1.2.0",
+     "model_name": "LayoutModel",
+     "state": {
+      "_model_module": "@jupyter-widgets/base",
+      "_model_module_version": "1.2.0",
+      "_model_name": "LayoutModel",
+      "_view_count": null,
+      "_view_module": "@jupyter-widgets/base",
+      "_view_module_version": "1.2.0",
+      "_view_name": "LayoutView",
+      "align_content": null,
+      "align_items": null,
+      "align_self": null,
+      "border": null,
+      "bottom": null,
+      "display": null,
+      "flex": null,
+      "flex_flow": null,
+      "grid_area": null,
+      "grid_auto_columns": null,
+      "grid_auto_flow": null,
+      "grid_auto_rows": null,
+      "grid_column": null,
+      "grid_gap": null,
+      "grid_row": null,
+      "grid_template_areas": null,
+      "grid_template_columns": null,
+      "grid_template_rows": null,
+      "height": null,
+      "justify_content": null,
+      "justify_items": null,
+      "left": null,
+      "margin": null,
+      "max_height": null,
+      "max_width": null,
+      "min_height": null,
+      "min_width": null,
+      "object_fit": null,
+      "object_position": null,
+      "order": null,
+      "overflow": null,
+      "overflow_x": null,
+      "overflow_y": null,
+      "padding": null,
+      "right": null,
+      "top": null,
+      "visibility": null,
+      "width": null
+     }
+    },
+    "6d3d418868564e10b5abc88b2daab932": {
+     "model_module": "@jupyter-widgets/controls",
+     "model_module_version": "1.5.0",
+     "model_name": "HBoxModel",
+     "state": {
+      "_dom_classes": [],
+      "_model_module": "@jupyter-widgets/controls",
+      "_model_module_version": "1.5.0",
+      "_model_name": "HBoxModel",
+      "_view_count": null,
+      "_view_module": "@jupyter-widgets/controls",
+      "_view_module_version": "1.5.0",
+      "_view_name": "HBoxView",
+      "box_style": "",
+      "children": [
+       "IPY_MODEL_377ff50b710f496ab9acb4554df58df2",
+       "IPY_MODEL_41a1b28a5785496a967027f51bf74aca",
+       "IPY_MODEL_77412b3af8de4a2988e099cc146c90cb"
+      ],
+      "layout": "IPY_MODEL_2641dbfe060e466bbe10038004edefc0"
+     }
+    },
+    "77412b3af8de4a2988e099cc146c90cb": {
+     "model_module": "@jupyter-widgets/controls",
+     "model_module_version": "1.5.0",
+     "model_name": "HTMLModel",
+     "state": {
+      "_dom_classes": [],
+      "_model_module": "@jupyter-widgets/controls",
+      "_model_module_version": "1.5.0",
+      "_model_name": "HTMLModel",
+      "_view_count": null,
+      "_view_module": "@jupyter-widgets/controls",
+      "_view_module_version": "1.5.0",
+      "_view_name": "HTMLView",
+      "description": "",
+      "description_tooltip": null,
+      "layout": "IPY_MODEL_5da7171e37014620b11ffc3fa623fc40",
+      "placeholder": "",
+      "style": "IPY_MODEL_fe9fd697c96143c784cfdb5530b9ccca",
+      "value": " 4/4 [00:11&lt;00:00,  2.59s/it]"
+     }
+    },
+    "8110e462f10b413e8dc59171fb84a13a": {
+     "model_module": "@jupyter-widgets/controls",
+     "model_module_version": "1.5.0",
+     "model_name": "LabelModel",
+     "state": {
+      "_dom_classes": [],
+      "_model_module": "@jupyter-widgets/controls",
+      "_model_module_version": "1.5.0",
+      "_model_name": "LabelModel",
+      "_view_count": null,
+      "_view_module": "@jupyter-widgets/controls",
+      "_view_module_version": "1.5.0",
+      "_view_name": "LabelView",
+      "description": "",
+      "description_tooltip": null,
+      "layout": "IPY_MODEL_a43226fe11eb4ec28c9619d7ee3a4618",
+      "placeholder": "",
+      "style": "IPY_MODEL_e7878cd9245b4e56b172e40008df453b",
+      "value": "Your token has been saved in your configured git credential helpers (store)."
+     }
+    },
+    "893b6342058945448a3861eb2c1c3a41": {
+     "model_module": "@jupyter-widgets/base",
+     "model_module_version": "1.2.0",
+     "model_name": "LayoutModel",
+     "state": {
+      "_model_module": "@jupyter-widgets/base",
+      "_model_module_version": "1.2.0",
+      "_model_name": "LayoutModel",
+      "_view_count": null,
+      "_view_module": "@jupyter-widgets/base",
+      "_view_module_version": "1.2.0",
+      "_view_name": "LayoutView",
+      "align_content": null,
+      "align_items": null,
+      "align_self": null,
+      "border": null,
+      "bottom": null,
+      "display": null,
+      "flex": null,
+      "flex_flow": null,
+      "grid_area": null,
+      "grid_auto_columns": null,
+      "grid_auto_flow": null,
+      "grid_auto_rows": null,
+      "grid_column": null,
+      "grid_gap": null,
+      "grid_row": null,
+      "grid_template_areas": null,
+      "grid_template_columns": null,
+      "grid_template_rows": null,
+      "height": null,
+      "justify_content": null,
+      "justify_items": null,
+      "left": null,
+      "margin": null,
+      "max_height": null,
+      "max_width": null,
+      "min_height": null,
+      "min_width": null,
+      "object_fit": null,
+      "object_position": null,
+      "order": null,
+      "overflow": null,
+      "overflow_x": null,
+      "overflow_y": null,
+      "padding": null,
+      "right": null,
+      "top": null,
+      "visibility": null,
+      "width": null
+     }
+    },
+    "89b4a59adc4942b290a9f3158b89423f": {
+     "model_module": "@jupyter-widgets/controls",
+     "model_module_version": "1.5.0",
+     "model_name": "DescriptionStyleModel",
+     "state": {
+      "_model_module": "@jupyter-widgets/controls",
+      "_model_module_version": "1.5.0",
+      "_model_name": "DescriptionStyleModel",
+      "_view_count": null,
+      "_view_module": "@jupyter-widgets/base",
+      "_view_module_version": "1.2.0",
+      "_view_name": "StyleView",
+      "description_width": ""
+     }
+    },
+    "9240afbf2b7346be8738c4017c4d91f1": {
+     "model_module": "@jupyter-widgets/base",
+     "model_module_version": "1.2.0",
+     "model_name": "LayoutModel",
+     "state": {
+      "_model_module": "@jupyter-widgets/base",
+      "_model_module_version": "1.2.0",
+      "_model_name": "LayoutModel",
+      "_view_count": null,
+      "_view_module": "@jupyter-widgets/base",
+      "_view_module_version": "1.2.0",
+      "_view_name": "LayoutView",
+      "align_content": null,
+      "align_items": null,
+      "align_self": null,
+      "border": null,
+      "bottom": null,
+      "display": null,
+      "flex": null,
+      "flex_flow": null,
+      "grid_area": null,
+      "grid_auto_columns": null,
+      "grid_auto_flow": null,
+      "grid_auto_rows": null,
+      "grid_column": null,
+      "grid_gap": null,
+      "grid_row": null,
+      "grid_template_areas": null,
+      "grid_template_columns": null,
+      "grid_template_rows": null,
+      "height": null,
+      "justify_content": null,
+      "justify_items": null,
+      "left": null,
+      "margin": null,
+      "max_height": null,
+      "max_width": null,
+      "min_height": null,
+      "min_width": null,
+      "object_fit": null,
+      "object_position": null,
+      "order": null,
+      "overflow": null,
+      "overflow_x": null,
+      "overflow_y": null,
+      "padding": null,
+      "right": null,
+      "top": null,
+      "visibility": null,
+      "width": null
+     }
+    },
+    "9d5a7a4379ce4e3493e7e050bfb173dc": {
+     "model_module": "@jupyter-widgets/base",
+     "model_module_version": "1.2.0",
+     "model_name": "LayoutModel",
+     "state": {
+      "_model_module": "@jupyter-widgets/base",
+      "_model_module_version": "1.2.0",
+      "_model_name": "LayoutModel",
+      "_view_count": null,
+      "_view_module": "@jupyter-widgets/base",
+      "_view_module_version": "1.2.0",
+      "_view_name": "LayoutView",
+      "align_content": null,
+      "align_items": null,
+      "align_self": null,
+      "border": null,
+      "bottom": null,
+      "display": null,
+      "flex": null,
+      "flex_flow": null,
+      "grid_area": null,
+      "grid_auto_columns": null,
+      "grid_auto_flow": null,
+      "grid_auto_rows": null,
+      "grid_column": null,
+      "grid_gap": null,
+      "grid_row": null,
+      "grid_template_areas": null,
+      "grid_template_columns": null,
+      "grid_template_rows": null,
+      "height": null,
+      "justify_content": null,
+      "justify_items": null,
+      "left": null,
+      "margin": null,
+      "max_height": null,
+      "max_width": null,
+      "min_height": null,
+      "min_width": null,
+      "object_fit": null,
+      "object_position": null,
+      "order": null,
+      "overflow": null,
+      "overflow_x": null,
+      "overflow_y": null,
+      "padding": null,
+      "right": null,
+      "top": null,
+      "visibility": null,
+      "width": null
+     }
+    },
+    "a43226fe11eb4ec28c9619d7ee3a4618": {
+     "model_module": "@jupyter-widgets/base",
+     "model_module_version": "1.2.0",
+     "model_name": "LayoutModel",
+     "state": {
+      "_model_module": "@jupyter-widgets/base",
+      "_model_module_version": "1.2.0",
+      "_model_name": "LayoutModel",
+      "_view_count": null,
+      "_view_module": "@jupyter-widgets/base",
+      "_view_module_version": "1.2.0",
+      "_view_name": "LayoutView",
+      "align_content": null,
+      "align_items": null,
+      "align_self": null,
+      "border": null,
+      "bottom": null,
+      "display": null,
+      "flex": null,
+      "flex_flow": null,
+      "grid_area": null,
+      "grid_auto_columns": null,
+      "grid_auto_flow": null,
+      "grid_auto_rows": null,
+      "grid_column": null,
+      "grid_gap": null,
+      "grid_row": null,
+      "grid_template_areas": null,
+      "grid_template_columns": null,
+      "grid_template_rows": null,
+      "height": null,
+      "justify_content": null,
+      "justify_items": null,
+      "left": null,
+      "margin": null,
+      "max_height": null,
+      "max_width": null,
+      "min_height": null,
+      "min_width": null,
+      "object_fit": null,
+      "object_position": null,
+      "order": null,
+      "overflow": null,
+      "overflow_x": null,
+      "overflow_y": null,
+      "padding": null,
+      "right": null,
+      "top": null,
+      "visibility": null,
+      "width": null
+     }
+    },
+    "b6ca0bfe87874730907ef1a4c500863a": {
+     "model_module": "@jupyter-widgets/controls",
+     "model_module_version": "1.5.0",
+     "model_name": "LabelModel",
+     "state": {
+      "_dom_classes": [],
+      "_model_module": "@jupyter-widgets/controls",
+      "_model_module_version": "1.5.0",
+      "_model_name": "LabelModel",
+      "_view_count": null,
+      "_view_module": "@jupyter-widgets/controls",
+      "_view_module_version": "1.5.0",
+      "_view_name": "LabelView",
+      "description": "",
+      "description_tooltip": null,
+      "layout": "IPY_MODEL_142c966c31fe4e5f99031da317e2ff54",
+      "placeholder": "",
+      "style": "IPY_MODEL_d67b40ceeba8412f91ac885cb816eb01",
+      "value": "Token is valid (permission: write)."
+     }
+    },
+    "b7b9ed6e0aa14e29bf27dbaf4e248f03": {
+     "model_module": "@jupyter-widgets/base",
+     "model_module_version": "1.2.0",
+     "model_name": "LayoutModel",
+     "state": {
+      "_model_module": "@jupyter-widgets/base",
+      "_model_module_version": "1.2.0",
+      "_model_name": "LayoutModel",
+      "_view_count": null,
+      "_view_module": "@jupyter-widgets/base",
+      "_view_module_version": "1.2.0",
+      "_view_name": "LayoutView",
+      "align_content": null,
+      "align_items": null,
+      "align_self": null,
+      "border": null,
+      "bottom": null,
+      "display": null,
+      "flex": null,
+      "flex_flow": null,
+      "grid_area": null,
+      "grid_auto_columns": null,
+      "grid_auto_flow": null,
+      "grid_auto_rows": null,
+      "grid_column": null,
+      "grid_gap": null,
+      "grid_row": null,
+      "grid_template_areas": null,
+      "grid_template_columns": null,
+      "grid_template_rows": null,
+      "height": null,
+      "justify_content": null,
+      "justify_items": null,
+      "left": null,
+      "margin": null,
+      "max_height": null,
+      "max_width": null,
+      "min_height": null,
+      "min_width": null,
+      "object_fit": null,
+      "object_position": null,
+      "order": null,
+      "overflow": null,
+      "overflow_x": null,
+      "overflow_y": null,
+      "padding": null,
+      "right": null,
+      "top": null,
+      "visibility": null,
+      "width": null
+     }
+    },
+    "bc34088935944cc8b02b2386239a3639": {
+     "model_module": "@jupyter-widgets/base",
+     "model_module_version": "1.2.0",
+     "model_name": "LayoutModel",
+     "state": {
+      "_model_module": "@jupyter-widgets/base",
+      "_model_module_version": "1.2.0",
+      "_model_name": "LayoutModel",
+      "_view_count": null,
+      "_view_module": "@jupyter-widgets/base",
+      "_view_module_version": "1.2.0",
+      "_view_name": "LayoutView",
+      "align_content": null,
+      "align_items": null,
+      "align_self": null,
+      "border": null,
+      "bottom": null,
+      "display": null,
+      "flex": null,
+      "flex_flow": null,
+      "grid_area": null,
+      "grid_auto_columns": null,
+      "grid_auto_flow": null,
+      "grid_auto_rows": null,
+      "grid_column": null,
+      "grid_gap": null,
+      "grid_row": null,
+      "grid_template_areas": null,
+      "grid_template_columns": null,
+      "grid_template_rows": null,
+      "height": null,
+      "justify_content": null,
+      "justify_items": null,
+      "left": null,
+      "margin": null,
+      "max_height": null,
+      "max_width": null,
+      "min_height": null,
+      "min_width": null,
+      "object_fit": null,
+      "object_position": null,
+      "order": null,
+      "overflow": null,
+      "overflow_x": null,
+      "overflow_y": null,
+      "padding": null,
+      "right": null,
+      "top": null,
+      "visibility": null,
+      "width": null
+     }
+    },
+    "bf91e9029f394c35874b4d35d61dd2c8": {
+     "model_module": "@jupyter-widgets/controls",
+     "model_module_version": "1.5.0",
+     "model_name": "DescriptionStyleModel",
+     "state": {
+      "_model_module": "@jupyter-widgets/controls",
+      "_model_module_version": "1.5.0",
+      "_model_name": "DescriptionStyleModel",
+      "_view_count": null,
+      "_view_module": "@jupyter-widgets/base",
+      "_view_module_version": "1.2.0",
+      "_view_name": "StyleView",
+      "description_width": ""
+     }
+    },
+    "c7d2883b47e741b4a0af037342ce88b3": {
+     "model_module": "@jupyter-widgets/controls",
+     "model_module_version": "1.5.0",
+     "model_name": "DescriptionStyleModel",
+     "state": {
+      "_model_module": "@jupyter-widgets/controls",
+      "_model_module_version": "1.5.0",
+      "_model_name": "DescriptionStyleModel",
+      "_view_count": null,
+      "_view_module": "@jupyter-widgets/base",
+      "_view_module_version": "1.2.0",
+      "_view_name": "StyleView",
+      "description_width": ""
+     }
+    },
+    "cabd977f2993428d91fc75df5a15328e": {
+     "model_module": "@jupyter-widgets/base",
+     "model_module_version": "1.2.0",
+     "model_name": "LayoutModel",
+     "state": {
+      "_model_module": "@jupyter-widgets/base",
+      "_model_module_version": "1.2.0",
+      "_model_name": "LayoutModel",
+      "_view_count": null,
+      "_view_module": "@jupyter-widgets/base",
+      "_view_module_version": "1.2.0",
+      "_view_name": "LayoutView",
+      "align_content": null,
+      "align_items": null,
+      "align_self": null,
+      "border": null,
+      "bottom": null,
+      "display": null,
+      "flex": null,
+      "flex_flow": null,
+      "grid_area": null,
+      "grid_auto_columns": null,
+      "grid_auto_flow": null,
+      "grid_auto_rows": null,
+      "grid_column": null,
+      "grid_gap": null,
+      "grid_row": null,
+      "grid_template_areas": null,
+      "grid_template_columns": null,
+      "grid_template_rows": null,
+      "height": null,
+      "justify_content": null,
+      "justify_items": null,
+      "left": null,
+      "margin": null,
+      "max_height": null,
+      "max_width": null,
+      "min_height": null,
+      "min_width": null,
+      "object_fit": null,
+      "object_position": null,
+      "order": null,
+      "overflow": null,
+      "overflow_x": null,
+      "overflow_y": null,
+      "padding": null,
+      "right": null,
+      "top": null,
+      "visibility": null,
+      "width": null
+     }
+    },
+    "d00821b88efa4256b29d52fe816a7c89": {
+     "model_module": "@jupyter-widgets/controls",
+     "model_module_version": "1.5.0",
+     "model_name": "LabelModel",
+     "state": {
+      "_dom_classes": [],
+      "_model_module": "@jupyter-widgets/controls",
+      "_model_module_version": "1.5.0",
+      "_model_name": "LabelModel",
+      "_view_count": null,
+      "_view_module": "@jupyter-widgets/controls",
+      "_view_module_version": "1.5.0",
+      "_view_name": "LabelView",
+      "description": "",
+      "description_tooltip": null,
+      "layout": "IPY_MODEL_0424a259c1d34333951b757c3c705b6f",
+      "placeholder": "",
+      "style": "IPY_MODEL_89b4a59adc4942b290a9f3158b89423f",
+      "value": "Login successful"
+     }
+    },
+    "d67b40ceeba8412f91ac885cb816eb01": {
+     "model_module": "@jupyter-widgets/controls",
+     "model_module_version": "1.5.0",
+     "model_name": "DescriptionStyleModel",
+     "state": {
+      "_model_module": "@jupyter-widgets/controls",
+      "_model_module_version": "1.5.0",
+      "_model_name": "DescriptionStyleModel",
+      "_view_count": null,
+      "_view_module": "@jupyter-widgets/base",
+      "_view_module_version": "1.2.0",
+      "_view_name": "StyleView",
+      "description_width": ""
+     }
+    },
+    "e01c3514a6904c79b5646688a515ca10": {
+     "model_module": "@jupyter-widgets/controls",
+     "model_module_version": "1.5.0",
+     "model_name": "CheckboxModel",
+     "state": {
+      "_dom_classes": [],
+      "_model_module": "@jupyter-widgets/controls",
+      "_model_module_version": "1.5.0",
+      "_model_name": "CheckboxModel",
+      "_view_count": null,
+      "_view_module": "@jupyter-widgets/controls",
+      "_view_module_version": "1.5.0",
+      "_view_name": "CheckboxView",
+      "description": "Add token as git credential?",
+      "description_tooltip": null,
+      "disabled": false,
+      "indent": true,
+      "layout": "IPY_MODEL_893b6342058945448a3861eb2c1c3a41",
+      "style": "IPY_MODEL_f8af58e353b94164b34d2ce064252dc2",
+      "value": true
+     }
+    },
+    "e0c376506a5a4757a9d51a5f181f3fb6": {
+     "model_module": "@jupyter-widgets/controls",
+     "model_module_version": "1.5.0",
+     "model_name": "ProgressStyleModel",
+     "state": {
+      "_model_module": "@jupyter-widgets/controls",
+      "_model_module_version": "1.5.0",
+      "_model_name": "ProgressStyleModel",
+      "_view_count": null,
+      "_view_module": "@jupyter-widgets/base",
+      "_view_module_version": "1.2.0",
+      "_view_name": "StyleView",
+      "bar_color": null,
+      "description_width": ""
+     }
+    },
+    "e1063035ef1e42768dd984653d992137": {
+     "model_module": "@jupyter-widgets/base",
+     "model_module_version": "1.2.0",
+     "model_name": "LayoutModel",
+     "state": {
+      "_model_module": "@jupyter-widgets/base",
+      "_model_module_version": "1.2.0",
+      "_model_name": "LayoutModel",
+      "_view_count": null,
+      "_view_module": "@jupyter-widgets/base",
+      "_view_module_version": "1.2.0",
+      "_view_name": "LayoutView",
+      "align_content": null,
+      "align_items": "center",
+      "align_self": null,
+      "border": null,
+      "bottom": null,
+      "display": "flex",
+      "flex": null,
+      "flex_flow": "column",
+      "grid_area": null,
+      "grid_auto_columns": null,
+      "grid_auto_flow": null,
+      "grid_auto_rows": null,
+      "grid_column": null,
+      "grid_gap": null,
+      "grid_row": null,
+      "grid_template_areas": null,
+      "grid_template_columns": null,
+      "grid_template_rows": null,
+      "height": null,
+      "justify_content": null,
+      "justify_items": null,
+      "left": null,
+      "margin": null,
+      "max_height": null,
+      "max_width": null,
+      "min_height": null,
+      "min_width": null,
+      "object_fit": null,
+      "object_position": null,
+      "order": null,
+      "overflow": null,
+      "overflow_x": null,
+      "overflow_y": null,
+      "padding": null,
+      "right": null,
+      "top": null,
+      "visibility": null,
+      "width": "50%"
+     }
+    },
+    "e7878cd9245b4e56b172e40008df453b": {
+     "model_module": "@jupyter-widgets/controls",
+     "model_module_version": "1.5.0",
+     "model_name": "DescriptionStyleModel",
+     "state": {
+      "_model_module": "@jupyter-widgets/controls",
+      "_model_module_version": "1.5.0",
+      "_model_name": "DescriptionStyleModel",
+      "_view_count": null,
+      "_view_module": "@jupyter-widgets/base",
+      "_view_module_version": "1.2.0",
+      "_view_name": "StyleView",
+      "description_width": ""
+     }
+    },
+    "f8af58e353b94164b34d2ce064252dc2": {
+     "model_module": "@jupyter-widgets/controls",
+     "model_module_version": "1.5.0",
+     "model_name": "DescriptionStyleModel",
+     "state": {
+      "_model_module": "@jupyter-widgets/controls",
+      "_model_module_version": "1.5.0",
+      "_model_name": "DescriptionStyleModel",
+      "_view_count": null,
+      "_view_module": "@jupyter-widgets/base",
+      "_view_module_version": "1.2.0",
+      "_view_name": "StyleView",
+      "description_width": ""
+     }
+    },
+    "fe9fd697c96143c784cfdb5530b9ccca": {
+     "model_module": "@jupyter-widgets/controls",
+     "model_module_version": "1.5.0",
+     "model_name": "DescriptionStyleModel",
+     "state": {
+      "_model_module": "@jupyter-widgets/controls",
+      "_model_module_version": "1.5.0",
+      "_model_name": "DescriptionStyleModel",
+      "_view_count": null,
+      "_view_module": "@jupyter-widgets/base",
+      "_view_module_version": "1.2.0",
+      "_view_name": "StyleView",
+      "description_width": ""
+     }
+    }
+   }
+  }
+ },
+ "nbformat": 4,
+ "nbformat_minor": 4
+}

LICENSE ADDED Viewed

	@@ -0,0 +1,201 @@

+                                 Apache License
+                           Version 2.0, January 2004
+                        http://www.apache.org/licenses/
+   TERMS AND CONDITIONS FOR USE, REPRODUCTION, AND DISTRIBUTION
+   1. Definitions.
+      "License" shall mean the terms and conditions for use, reproduction,
+      and distribution as defined by Sections 1 through 9 of this document.
+      "Licensor" shall mean the copyright owner or entity authorized by
+      the copyright owner that is granting the License.
+      "Legal Entity" shall mean the union of the acting entity and all
+      other entities that control, are controlled by, or are under common
+      control with that entity. For the purposes of this definition,
+      "control" means (i) the power, direct or indirect, to cause the
+      direction or management of such entity, whether by contract or
+      otherwise, or (ii) ownership of fifty percent (50%) or more of the
+      outstanding shares, or (iii) beneficial ownership of such entity.
+      "You" (or "Your") shall mean an individual or Legal Entity
+      exercising permissions granted by this License.
+      "Source" form shall mean the preferred form for making modifications,
+      including but not limited to software source code, documentation
+      source, and configuration files.
+      "Object" form shall mean any form resulting from mechanical
+      transformation or translation of a Source form, including but
+      not limited to compiled object code, generated documentation,
+      and conversions to other media types.
+      "Work" shall mean the work of authorship, whether in Source or
+      Object form, made available under the License, as indicated by a
+      copyright notice that is included in or attached to the work
+      (an example is provided in the Appendix below).
+      "Derivative Works" shall mean any work, whether in Source or Object
+      form, that is based on (or derived from) the Work and for which the
+      editorial revisions, annotations, elaborations, or other modifications
+      represent, as a whole, an original work of authorship. For the purposes
+      of this License, Derivative Works shall not include works that remain
+      separable from, or merely link (or bind by name) to the interfaces of,
+      the Work and Derivative Works thereof.
+      "Contribution" shall mean any work of authorship, including
+      the original version of the Work and any modifications or additions
+      to that Work or Derivative Works thereof, that is intentionally
+      submitted to Licensor for inclusion in the Work by the copyright owner
+      or by an individual or Legal Entity authorized to submit on behalf of
+      the copyright owner. For the purposes of this definition, "submitted"
+      means any form of electronic, verbal, or written communication sent
+      to the Licensor or its representatives, including but not limited to
+      communication on electronic mailing lists, source code control systems,
+      and issue tracking systems that are managed by, or on behalf of, the
+      Licensor for the purpose of discussing and improving the Work, but
+      excluding communication that is conspicuously marked or otherwise
+      designated in writing by the copyright owner as "Not a Contribution."
+      "Contributor" shall mean Licensor and any individual or Legal Entity
+      on behalf of whom a Contribution has been received by Licensor and
+      subsequently incorporated within the Work.
+   2. Grant of Copyright License. Subject to the terms and conditions of
+      this License, each Contributor hereby grants to You a perpetual,
+      worldwide, non-exclusive, no-charge, royalty-free, irrevocable
+      copyright license to reproduce, prepare Derivative Works of,
+      publicly display, publicly perform, sublicense, and distribute the
+      Work and such Derivative Works in Source or Object form.
+   3. Grant of Patent License. Subject to the terms and conditions of
+      this License, each Contributor hereby grants to You a perpetual,
+      worldwide, non-exclusive, no-charge, royalty-free, irrevocable
+      (except as stated in this section) patent license to make, have made,
+      use, offer to sell, sell, import, and otherwise transfer the Work,
+      where such license applies only to those patent claims licensable
+      by such Contributor that are necessarily infringed by their
+      Contribution(s) alone or by combination of their Contribution(s)
+      with the Work to which such Contribution(s) was submitted. If You
+      institute patent litigation against any entity (including a
+      cross-claim or counterclaim in a lawsuit) alleging that the Work
+      or a Contribution incorporated within the Work constitutes direct
+      or contributory patent infringement, then any patent licenses
+      granted to You under this License for that Work shall terminate
+      as of the date such litigation is filed.
+   4. Redistribution. You may reproduce and distribute copies of the
+      Work or Derivative Works thereof in any medium, with or without
+      modifications, and in Source or Object form, provided that You
+      meet the following conditions:
+      (a) You must give any other recipients of the Work or
+          Derivative Works a copy of this License; and
+      (b) You must cause any modified files to carry prominent notices
+          stating that You changed the files; and
+      (c) You must retain, in the Source form of any Derivative Works
+          that You distribute, all copyright, patent, trademark, and
+          attribution notices from the Source form of the Work,
+          excluding those notices that do not pertain to any part of
+          the Derivative Works; and
+      (d) If the Work includes a "NOTICE" text file as part of its
+          distribution, then any Derivative Works that You distribute must
+          include a readable copy of the attribution notices contained
+          within such NOTICE file, excluding those notices that do not
+          pertain to any part of the Derivative Works, in at least one
+          of the following places: within a NOTICE text file distributed
+          as part of the Derivative Works; within the Source form or
+          documentation, if provided along with the Derivative Works; or,
+          within a display generated by the Derivative Works, if and
+          wherever such third-party notices normally appear. The contents
+          of the NOTICE file are for informational purposes only and
+          do not modify the License. You may add Your own attribution
+          notices within Derivative Works that You distribute, alongside
+          or as an addendum to the NOTICE text from the Work, provided
+          that such additional attribution notices cannot be construed
+          as modifying the License.
+      You may add Your own copyright statement to Your modifications and
+      may provide additional or different license terms and conditions
+      for use, reproduction, or distribution of Your modifications, or
+      for any such Derivative Works as a whole, provided Your use,
+      reproduction, and distribution of the Work otherwise complies with
+      the conditions stated in this License.
+   5. Submission of Contributions. Unless You explicitly state otherwise,
+      any Contribution intentionally submitted for inclusion in the Work
+      by You to the Licensor shall be under the terms and conditions of
+      this License, without any additional terms or conditions.
+      Notwithstanding the above, nothing herein shall supersede or modify
+      the terms of any separate license agreement you may have executed
+      with Licensor regarding such Contributions.
+   6. Trademarks. This License does not grant permission to use the trade
+      names, trademarks, service marks, or product names of the Licensor,
+      except as required for reasonable and customary use in describing the
+      origin of the Work and reproducing the content of the NOTICE file.
+   7. Disclaimer of Warranty. Unless required by applicable law or
+      agreed to in writing, Licensor provides the Work (and each
+      Contributor provides its Contributions) on an "AS IS" BASIS,
+      WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or
+      implied, including, without limitation, any warranties or conditions
+      of TITLE, NON-INFRINGEMENT, MERCHANTABILITY, or FITNESS FOR A
+      PARTICULAR PURPOSE. You are solely responsible for determining the
+      appropriateness of using or redistributing the Work and assume any
+      risks associated with Your exercise of permissions under this License.
+   8. Limitation of Liability. In no event and under no legal theory,
+      whether in tort (including negligence), contract, or otherwise,
+      unless required by applicable law (such as deliberate and grossly
+      negligent acts) or agreed to in writing, shall any Contributor be
+      liable to You for damages, including any direct, indirect, special,
+      incidental, or consequential damages of any character arising as a
+      result of this License or out of the use or inability to use the
+      Work (including but not limited to damages for loss of goodwill,
+      work stoppage, computer failure or malfunction, or any and all
+      other commercial damages or losses), even if such Contributor
+      has been advised of the possibility of such damages.
+   9. Accepting Warranty or Additional Liability. While redistributing
+      the Work or Derivative Works thereof, You may choose to offer,
+      and charge a fee for, acceptance of support, warranty, indemnity,
+      or other liability obligations and/or rights consistent with this
+      License. However, in accepting such obligations, You may act only
+      on Your own behalf and on Your sole responsibility, not on behalf
+      of any other Contributor, and only if You agree to indemnify,
+      defend, and hold each Contributor harmless for any liability
+      incurred by, or claims asserted against, such Contributor by reason
+      of your accepting any such warranty or additional liability.
+   END OF TERMS AND CONDITIONS
+   APPENDIX: How to apply the Apache License to your work.
+      To apply the Apache License to your work, attach the following
+      boilerplate notice, with the fields enclosed by brackets "[]"
+      replaced with your own identifying information. (Don't include
+      the brackets!)  The text should be enclosed in the appropriate
+      comment syntax for the file format. We also recommend that a
+      file or class name and description of purpose be included on the
+      same "printed page" as the copyright notice for easier
+      identification within third-party archives.
+   Copyright [yyyy] [name of copyright owner]
+   Licensed under the Apache License, Version 2.0 (the "License");
+   you may not use this file except in compliance with the License.
+   You may obtain a copy of the License at
+       http://www.apache.org/licenses/LICENSE-2.0
+   Unless required by applicable law or agreed to in writing, software
+   distributed under the License is distributed on an "AS IS" BASIS,
+   WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+   See the License for the specific language governing permissions and
+   limitations under the License.

PaliGemma_DPO.ipynb ADDED Viewed

The diff for this file is too large to render. See raw diff

README.md CHANGED Viewed

@@ -1,3 +1,26 @@
----
-license: apache-2.0
----

+![Smol](https://github.com/merveenoyan/smol-vision/assets/53175384/930d5b36-bb9d-4ab6-8b5a-4fec28c48f80)
+# Smol Vision 🐣
+Recipes for shrinking, optimizing, customizing cutting edge vision and multimodal AI models.
+Latest examples 👇🏻
+- [Fine-tuning SmolVLM2 on Video Captioning](https://github.com/merveenoyan/smol-vision/blob/main/Fine_tune_SmolVLM2_on_Video.ipynb)
+- [Multimodal RAG using ColPali and Qwen2-VL](https://github.com/merveenoyan/smol-vision/blob/main/ColPali_%2B_Qwen2_VL.ipynb)
+- [Fine-tune ColPali for Multimodal RAG](https://github.com/merveenoyan/smol-vision/blob/main/Finetune_ColPali.ipynb)
+**Note**: The script and notebook are updated to fix few issues related to QLoRA!
+|                              | Notebook                                                                                                                                                                | Description                                                                                                |
+|------------------------------|-------------------------------------------------------------------------------------------------------------------------------------------------------------------------|------------------------------------------------------------------------------------------------------------|
+| Quantization/ONNX            | [Faster and Smaller Zero-shot Object Detection with Optimum](https://github.com/merveenoyan/smol-vision/blob/main/Faster_Zero_shot_Object_Detection_with_Optimum.ipynb) | Quantize the state-of-the-art zero-shot object detection model OWLv2 using Optimum ONNXRuntime tools.      |
+| VLM Fine-tuning              | [Fine-tune PaliGemma](https://github.com/merveenoyan/smol-vision/blob/main/Fine_tune_PaliGemma.ipynb)                                                                   | Fine-tune state-of-the-art vision language backbone PaliGemma using transformers.                          |
+| Intro to Optimum/ORT         | [Optimizing DETR with 🤗 Optimum](https://github.com/merveenoyan/smol-vision/blob/main/Reduce_any_model_to_fp16_using_%F0%9F%A4%97_Optimum_DETR.ipynb)                   | A soft introduction to exporting vision models to ONNX and quantizing them.                                |
+| Model Shrinking              | [Knowledge Distillation for Computer Vision](https://huggingface.co/docs/transformers/en/tasks/knowledge_distillation_for_image_classification)                         | Knowledge distillation for image classification.                                                           |
+| Quantization                 | [Fit in vision models using Quanto](https://github.com/merveenoyan/smol-vision/blob/main/Fit_in_vision_models_using_quanto.ipynb)                                       | Fit in vision models to smaller hardware using quanto                                                      |
+| Speed-up                     | [Faster foundation models with torch.compile](https://github.com/merveenoyan/smol-vision/blob/main/Faster_foundation_models_with_torch_compile.ipynb)                   | Improving latency for foundation models using `torch.compile`                                              |
+| VLM Fine-tuning     | [Fine-tune Florence-2](https://github.com/merveenoyan/smol-vision/blob/main/Fine_tune_Florence_2.ipynb)                                                                          | Fine-tune Florence-2 on DocVQA dataset                                                                |
+| VLM Fine-tuning    | [QLoRA/Fine-tune IDEFICS3 or SmolVLM on VQAv2](https://github.com/merveenoyan/smol-vision/blob/main/Smol_VLM_FT.ipynb)                                                                          | QLoRA/Full Fine-tune IDEFICS3 or SmolVLM on VQAv2 dataset                                                                 |
+| VLM Fine-tuning (Script)   | [QLoRA Fine-tune IDEFICS3 on VQAv2](https://github.com/merveenoyan/smol-vision/blob/main/smolvlm.py)                                                                          | QLoRA/Full Fine-tune IDEFICS3 or SmolVLM on VQAv2 dataset                                                                 |
+| Multimodal RAG    | [Multimodal RAG using ColPali and Qwen2-VL](https://github.com/merveenoyan/smol-vision/blob/main/ColPali_%2B_Qwen2_VL.ipynb)                                                                          | Learn to retrieve documents and pipeline to RAG without hefty document processing using ColPali through Byaldi and do the generation with Qwen2-VL                                                              |
+| Multimodal Retriever Fine-tuning    | [Fine-tune ColPali for Multimodal RAG](https://github.com/merveenoyan/smol-vision/blob/main/Finetune_ColPali.ipynb)                                                                          | Learn to apply contrastive fine-tuning on ColPali to customize it for your own multimodal document RAG use case                                                              |
+| Speed-up/Memory Optimization | Vision language model serving using TGI (SOON)                                                                                                                          | Explore speed-ups and memory improvements for vision-language model serving with text-generation inference |
+| Quantization/Optimum/ORT     | All levels of quantization and graph optimizations for Image Segmentation using Optimum (SOON)                                                                          | End-to-end model optimization using Optimum                                                                |

Reduce_any_model_to_fp16_using_🤗_Optimum_DETR.ipynb ADDED Viewed

The diff for this file is too large to render. See raw diff

ShieldGemma_2_for_Vision_LM_Safety.ipynb ADDED Viewed

The diff for this file is too large to render. See raw diff

Smol_VLM_FT.ipynb ADDED Viewed

	@@ -0,0 +1,1271 @@

+{
+ "cells": [
+  {
+   "cell_type": "markdown",
+   "metadata": {
+    "id": "nc0g2NLpUSGr"
+   },
+   "source": [
+    "# Fine-tune SmolVLM on Visual Question Answering using Consumer GPU with QLoRA\n",
+    "\n",
+    "In this notebook we will fine-tune SmolVLM VQAv2 dataset. With this notebook you can also fine-tune Idefics3, since both models have the same model class/architecture.\n",
+    "\n",
+    "We will use some techniques in this notebook that will let you fine-tune the model on L4 with batch size of 4 only using around 16.4 GB of VRAM. We ran this notebook in that setup to test, but because we were able to afford A100 this notebook was last ran on an A100."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 1,
+   "metadata": {
+    "colab": {
+     "base_uri": "https://localhost:8080/"
+    },
+    "id": "WIhA1lQ7j0kw",
+    "outputId": "d152531d-8a63-459f-d0b5-f61a47b268d2"
+   },
+   "outputs": [],
+   "source": [
+    "!pip install -q accelerate datasets peft bitsandbytes tensorboard"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 2,
+   "metadata": {
+    "colab": {
+     "base_uri": "https://localhost:8080/"
+    },
+    "id": "XyJaqZZ3uYYl",
+    "outputId": "eff31ad7-7a77-4391-a1ed-6a871e667be5"
+   },
+   "outputs": [],
+   "source": [
+    "!pip install -q flash-attn --no-build-isolation"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {
+    "id": "wAeMA0heVBjT"
+   },
+   "source": [
+    "We will push out model to Hub so we need to authenticate ourselves."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {
+    "colab": {
+     "base_uri": "https://localhost:8080/",
+     "height": 17,
+     "referenced_widgets": [
+      "261a3abc28d74e4ca5af6f9df8cea3e5",
+      "b6284cfacfd642278a7809a154463d69",
+      "62c12672f59349b9ade248bee799fa5a",
+      "9af532f878ab491096358d3bc83250d8",
+      "599303d9f1204c85bca500c859dd0d87",
+      "00617a46b15d45648c4796a91c96ec57",
+      "5492da586f594365afc30ee6da1bf67c",
+      "86aa1abb905346bf8956754a9704f250",
+      "eeb2fbfd6cd54c4aa3983dc334a5377d",
+      "ed34441fca164b389dfea1eabdba6e4a",
+      "99f5b0432c1849128fa181b88925c77b",
+      "5e529d6d6c4e40b4863961ea63bf259a",
+      "ebfcd83e42ec46afb772d53ad7f35d43",
+      "94958be916d6439d87dcd45c59178bec",
+      "31a0c4a7fcff4744be56adf4125ef4e6",
+      "2c975a8158bf49b389d47a5c4e40c97b",
+      "b474bf8f464d40d8865665e4c7f0a411",
+      "f8a75ac273fc408f923bf9d7f7263db8",
+      "dd08ce6386184df38f47348e547738d8",
+      "3aef5e8d5d9e4bd29bd3790ad139c02c"
+     ]
+    },
+    "id": "yKd5xtSGj7cm",
+    "outputId": "63b352c0-3f7d-4945-add2-52102246d7b2"
+   },
+   "outputs": [],
+   "source": [
+    "from huggingface_hub import notebook_login\n",
+    "\n",
+    "notebook_login()"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {
+    "id": "WRq8ve-LVAzU"
+   },
+   "source": [
+    "In this notebook we will not do full fine-tuning but use QLoRA method, which loads an adapter to the quantized version of the model, saving space. If you want to do full fine-tuning, set `USE_LORA` and `USE_QLORA` to False. If you want to do LoRA, set `USE_QLORA` to False and `USE_LORA` to True."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 1,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "import os\n",
+    "os.environ[\"CUDA_DEVICE_ORDER\"] = \"PCI_BUS_ID\"\n",
+    "os.environ[\"CUDA_VISIBLE_DEVICES\"] = \"1, 2\""
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 2,
+   "metadata": {
+    "colab": {
+     "base_uri": "https://localhost:8080/"
+    },
+    "id": "b9CDMq0duYYn",
+    "outputId": "65a4a5fa-fe4d-4243-b2d7-405a8aa81c04"
+   },
+   "outputs": [
+    {
+     "data": {
+      "application/vnd.jupyter.widget-view+json": {
+       "model_id": "23d3d175e6e642c7abc2bce09b73cf4d",
+       "version_major": 2,
+       "version_minor": 0
+      },
+      "text/plain": [
+       "processor_config.json:   0%|          | 0.00/68.0 [00:00<?, ?B/s]"
+      ]
+     },
+     "metadata": {},
+     "output_type": "display_data"
+    },
+    {
+     "data": {
+      "application/vnd.jupyter.widget-view+json": {
+       "model_id": "db6ca8f47f274464b135909c907c946a",
+       "version_major": 2,
+       "version_minor": 0
+      },
+      "text/plain": [
+       "chat_template.json:   0%|          | 0.00/434 [00:00<?, ?B/s]"
+      ]
+     },
+     "metadata": {},
+     "output_type": "display_data"
+    },
+    {
+     "data": {
+      "application/vnd.jupyter.widget-view+json": {
+       "model_id": "d05822c6293c424fbf9df6ec0f6b532b",
+       "version_major": 2,
+       "version_minor": 0
+      },
+      "text/plain": [
+       "preprocessor_config.json:   0%|          | 0.00/486 [00:00<?, ?B/s]"
+      ]
+     },
+     "metadata": {},
+     "output_type": "display_data"
+    },
+    {
+     "data": {
+      "application/vnd.jupyter.widget-view+json": {
+       "model_id": "05582fca18f443d6965776a721e30e9f",
+       "version_major": 2,
+       "version_minor": 0
+      },
+      "text/plain": [
+       "tokenizer_config.json:   0%|          | 0.00/4.04k [00:00<?, ?B/s]"
+      ]
+     },
+     "metadata": {},
+     "output_type": "display_data"
+    },
+    {
+     "data": {
+      "application/vnd.jupyter.widget-view+json": {
+       "model_id": "3d8974fd1ba9415c8070c1eab8ad75cb",
+       "version_major": 2,
+       "version_minor": 0
+      },
+      "text/plain": [
+       "vocab.json:   0%|          | 0.00/801k [00:00<?, ?B/s]"
+      ]
+     },
+     "metadata": {},
+     "output_type": "display_data"
+    },
+    {
+     "data": {
+      "application/vnd.jupyter.widget-view+json": {
+       "model_id": "648257c1b1c24e25a26355bddf75aa41",
+       "version_major": 2,
+       "version_minor": 0
+      },
+      "text/plain": [
+       "merges.txt:   0%|          | 0.00/466k [00:00<?, ?B/s]"
+      ]
+     },
+     "metadata": {},
+     "output_type": "display_data"
+    },
+    {
+     "data": {
+      "application/vnd.jupyter.widget-view+json": {
+       "model_id": "afa9a31c6b7f45e082ae07dea4a2600e",
+       "version_major": 2,
+       "version_minor": 0
+      },
+      "text/plain": [
+       "tokenizer.json:   0%|          | 0.00/3.52M [00:00<?, ?B/s]"
+      ]
+     },
+     "metadata": {},
+     "output_type": "display_data"
+    },
+    {
+     "data": {
+      "application/vnd.jupyter.widget-view+json": {
+       "model_id": "92232af543a4446cac53e4fcf3f4b6e1",
+       "version_major": 2,
+       "version_minor": 0
+      },
+      "text/plain": [
+       "added_tokens.json:   0%|          | 0.00/92.0 [00:00<?, ?B/s]"
+      ]
+     },
+     "metadata": {},
+     "output_type": "display_data"
+    },
+    {
+     "data": {
+      "application/vnd.jupyter.widget-view+json": {
+       "model_id": "a5f06e59634f4edf9f3d9409846a2b31",
+       "version_major": 2,
+       "version_minor": 0
+      },
+      "text/plain": [
+       "special_tokens_map.json:   0%|          | 0.00/1.07k [00:00<?, ?B/s]"
+      ]
+     },
+     "metadata": {},
+     "output_type": "display_data"
+    },
+    {
+     "name": "stderr",
+     "output_type": "stream",
+     "text": [
+      "Some kwargs in processor config are unused and will not have any effect: image_seq_len. \n"
+     ]
+    },
+    {
+     "data": {
+      "application/vnd.jupyter.widget-view+json": {
+       "model_id": "7ddfa8718bc24882ba2b50a899656107",
+       "version_major": 2,
+       "version_minor": 0
+      },
+      "text/plain": [
+       "config.json:   0%|          | 0.00/7.08k [00:00<?, ?B/s]"
+      ]
+     },
+     "metadata": {},
+     "output_type": "display_data"
+    },
+    {
+     "data": {
+      "application/vnd.jupyter.widget-view+json": {
+       "model_id": "5983728a1c1e43edb4d16bee6ad40171",
+       "version_major": 2,
+       "version_minor": 0
+      },
+      "text/plain": [
+       "model.safetensors:   0%|          | 0.00/4.49G [00:00<?, ?B/s]"
+      ]
+     },
+     "metadata": {},
+     "output_type": "display_data"
+    },
+    {
+     "data": {
+      "application/vnd.jupyter.widget-view+json": {
+       "model_id": "dff574197f1f4466abb0eb46d36b8378",
+       "version_major": 2,
+       "version_minor": 0
+      },
+      "text/plain": [
+       "generation_config.json:   0%|          | 0.00/132 [00:00<?, ?B/s]"
+      ]
+     },
+     "metadata": {},
+     "output_type": "display_data"
+    },
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "(10536960, 2256809840)\n"
+     ]
+    }
+   ],
+   "source": [
+    "import torch\n",
+    "from peft import LoraConfig, prepare_model_for_kbit_training, get_peft_model\n",
+    "from transformers import AutoProcessor, BitsAndBytesConfig, Idefics3ForConditionalGeneration\n",
+    "\n",
+    "USE_LORA = False\n",
+    "USE_QLORA = True\n",
+    "SMOL = True\n",
+    "\n",
+    "model_id = \"HuggingFaceTB/SmolVLM-Base\" if SMOL else \"HuggingFaceM4/Idefics3-8B-Llama3\"\n",
+    "\n",
+    "processor = AutoProcessor.from_pretrained(\n",
+    "    model_id\n",
+    ")\n",
+    "\n",
+    "if USE_QLORA or USE_LORA:\n",
+    "    lora_config = LoraConfig(\n",
+    "        r=8,\n",
+    "        lora_alpha=8,\n",
+    "        lora_dropout=0.1,\n",
+    "        target_modules=['down_proj','o_proj','k_proj','q_proj','gate_proj','up_proj','v_proj'],\n",
+    "        use_dora=False if USE_QLORA else True,\n",
+    "        init_lora_weights=\"gaussian\"\n",
+    "    )\n",
+    "    lora_config.inference_mode = False\n",
+    "    if USE_QLORA:\n",
+    "        bnb_config = BitsAndBytesConfig(\n",
+    "            load_in_4bit=True,\n",
+    "            bnb_4bit_use_double_quant=True,\n",
+    "            bnb_4bit_quant_type=\"nf4\",\n",
+    "            bnb_4bit_compute_dtype=torch.bfloat16\n",
+    "        )\n",
+    "\n",
+    "    model = Idefics3ForConditionalGeneration.from_pretrained(\n",
+    "        model_id,\n",
+    "        quantization_config=bnb_config if USE_QLORA else None,\n",
+    "        _attn_implementation=\"flash_attention_2\",\n",
+    "        device_map=\"auto\"\n",
+    "    )\n",
+    "    model.add_adapter(lora_config)\n",
+    "    model.enable_adapters()\n",
+    "    model = prepare_model_for_kbit_training(model)\n",
+    "    model = get_peft_model(model, lora_config)\n",
+    "    print(model.get_nb_trainable_parameters())\n",
+    "else:\n",
+    "    model = Idefics3ForConditionalGeneration.from_pretrained(\n",
+    "        model_id,\n",
+    "        torch_dtype=torch.bfloat16,\n",
+    "        _attn_implementation=\"flash_attention_2\",\n",
+    "    ).to(DEVICE)\n",
+    "\n",
+    "    # if you'd like to only fine-tune LLM\n",
+    "    for param in model.model.vision_model.parameters():\n",
+    "        param.requires_grad = False"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {
+    "id": "WIVhpp0EyZO2"
+   },
+   "source": [
+    "The model as is is holding 2.7 GB of GPU RAM 💗"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {
+    "id": "LMTtg3dl3NX2"
+   },
+   "source": [
+    "## Loading the dataset and Preprocessing"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {
+    "id": "pWHMWTSZ3Pyr"
+   },
+   "source": [
+    "We will load a small portion of the VQAv2 dataset. We are loading a small portion of the model for education purposes."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 3,
+   "metadata": {
+    "id": "POOqKqYRka5O"
+   },
+   "outputs": [],
+   "source": [
+    "from datasets import load_dataset\n",
+    "ds = load_dataset('merve/vqav2-small', trust_remote_code=True)"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 6,
+   "metadata": {
+    "id": "Znf9vMo5rnSd"
+   },
+   "outputs": [],
+   "source": [
+    "split_ds = ds[\"validation\"].train_test_split(test_size=0.5)\n",
+    "train_ds = split_ds[\"train\"]"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 7,
+   "metadata": {
+    "colab": {
+     "base_uri": "https://localhost:8080/"
+    },
+    "id": "FIDioFlRuYYn",
+    "outputId": "79b697a7-d245-4fdc-b0e8-d9ffa8627953"
+   },
+   "outputs": [
+    {
+     "data": {
+      "text/plain": [
+       "Dataset({\n",
+       "    features: ['multiple_choice_answer', 'question', 'image'],\n",
+       "    num_rows: 10717\n",
+       "})"
+      ]
+     },
+     "execution_count": 7,
+     "metadata": {},
+     "output_type": "execute_result"
+    }
+   ],
+   "source": [
+    "train_ds"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {
+    "id": "5nwMO3n0X7Hv"
+   },
+   "source": [
+    "Let's write our data collating function. We will apply prompt template to have questions and answers together so model can learn to answer. Then we pass the formatted prompts and images to the processor which processes both."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 8,
+   "metadata": {
+    "id": "e0krVLZ-wNMl"
+   },
+   "outputs": [],
+   "source": [
+    "image_token_id = processor.tokenizer.additional_special_tokens_ids[\n",
+    "            processor.tokenizer.additional_special_tokens.index(\"<image>\")]\n",
+    "\n",
+    "def collate_fn(examples):\n",
+    "  texts = []\n",
+    "  images = []\n",
+    "  for example in examples:\n",
+    "      image = example[\"image\"]\n",
+    "      if image.mode != 'RGB':\n",
+    "        image = image.convert('RGB')\n",
+    "      question = example[\"question\"]\n",
+    "      answer = example[\"multiple_choice_answer\"]\n",
+    "      messages = [\n",
+    "          {\n",
+    "              \"role\": \"user\",\n",
+    "              \"content\": [\n",
+    "                  {\"type\": \"text\", \"text\": \"Answer briefly.\"},\n",
+    "                  {\"type\": \"image\"},\n",
+    "                  {\"type\": \"text\", \"text\": question}\n",
+    "              ]\n",
+    "          },\n",
+    "          {\n",
+    "              \"role\": \"assistant\",\n",
+    "              \"content\": [\n",
+    "                  {\"type\": \"text\", \"text\": answer}\n",
+    "              ]\n",
+    "          }\n",
+    "      ]\n",
+    "      text = processor.apply_chat_template(messages, add_generation_prompt=False)\n",
+    "      texts.append(text.strip())\n",
+    "      images.append([image])\n",
+    "\n",
+    "  batch = processor(text=texts, images=images, return_tensors=\"pt\", padding=True)\n",
+    "  labels = batch[\"input_ids\"].clone()\n",
+    "  labels[labels == processor.tokenizer.pad_token_id] = -100\n",
+    "  labels[labels == image_token_id] = -100\n",
+    "  batch[\"labels\"] = labels\n",
+    "\n",
+    "  return batch"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {
+    "id": "kEYDjWpE3LD5"
+   },
+   "source": [
+    "## Training"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {
+    "id": "QvAs896cdwg8"
+   },
+   "source": [
+    "We can now initialize `Trainer` and initialize `TrainingArguments` to pass to `Trainer`.\n",
+    "\n",
+    "Some notes:\n",
+    "- If you use 8-bit QLoRA with the below setup it uses around 16.4 GB VRAM (beautiful, fits comfortably inside L4, Colab free tier)\n",
+    "- We use gradient accumulation to simulate a larger batch size.\n",
+    "- We also save up on memory from intermediate activations by using gradient checkpointing.\n",
+    "\n",
+    "**Disclaimer:** \n",
+    "The techniques here aren't free lunch. The latter two will add additional compute to the training, thus slow down a bit (for reference on two A100s with bsz of 16, we were able to train for 2 hrs 43 mins with the gradient accumulation steps of 4, disabling it reduced it with 2 hr 35 mins). \n",
+    "If you want to speed-up, you might play around, reduce to 4-bit precision and have a higher batch size. Note that 4-bit might result in model learning less."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 15,
+   "metadata": {
+    "id": "QNE2yWAYrAhD"
+   },
+   "outputs": [],
+   "source": [
+    "from transformers import TrainingArguments, Trainer\n",
+    "\n",
+    "model_name = model_id.split(\"/\")[-1]\n",
+    "\n",
+    "training_args = TrainingArguments(\n",
+    "    num_train_epochs=1,\n",
+    "    per_device_train_batch_size=16,\n",
+    "    gradient_accumulation_steps=4,\n",
+    "    warmup_steps=50,\n",
+    "    learning_rate=1e-4,\n",
+    "    weight_decay=0.01,\n",
+    "    logging_steps=25,\n",
+    "    save_strategy=\"steps\",\n",
+    "    save_steps=250,\n",
+    "    save_total_limit=1,\n",
+    "    optim=\"paged_adamw_8bit\", # for 8-bit, keep this, else adamw_hf\n",
+    "    bf16=True, # underlying precision for 8bit\n",
+    "    output_dir=f\"./{model_name}-vqav2\",\n",
+    "    hub_model_id=f\"{model_name}-vqav2\",\n",
+    "    report_to=\"tensorboard\",\n",
+    "    remove_unused_columns=False,\n",
+    "    gradient_checkpointing=True\n",
+    ")\n"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 16,
+   "metadata": {
+    "id": "oBBSDpBhreJd"
+   },
+   "outputs": [
+    {
+     "name": "stderr",
+     "output_type": "stream",
+     "text": [
+      "Detected kernel version 5.4.0, which is below the recommended minimum of 5.5.0; this can cause the process to hang. It is recommended to upgrade the kernel to the minimum version or higher.\n"
+     ]
+    }
+   ],
+   "source": [
+    "trainer = Trainer(\n",
+    "    model=model,\n",
+    "    args=training_args,\n",
+    "    data_collator=collate_fn,\n",
+    "    train_dataset=train_ds,\n",
+    ")"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {
+    "id": "_QOCpw_-uYYo"
+   },
+   "outputs": [
+    {
+     "data": {
+      "text/html": [
+       "\n",
+       "    <div>\n",
+       "      \n",
+       "      <progress value='9' max='670' style='width:300px; height:20px; vertical-align: middle;'></progress>\n",
+       "      [  9/670 01:41 < 2:39:41, 0.07 it/s, Epoch 0.01/1]\n",
+       "    </div>\n",
+       "    <table border=\"1\" class=\"dataframe\">\n",
+       "  <thead>\n",
+       " <tr style=\"text-align: left;\">\n",
+       "      <th>Step</th>\n",
+       "      <th>Training Loss</th>\n",
+       "    </tr>\n",
+       "  </thead>\n",
+       "  <tbody>\n",
+       "  </tbody>\n",
+       "</table><p>"
+      ],
+      "text/plain": [
+       "<IPython.core.display.HTML object>"
+      ]
+     },
+     "metadata": {},
+     "output_type": "display_data"
+    }
+   ],
+   "source": [
+    "trainer.train()"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {
+    "id": "0hN0QD9_uYYo"
+   },
+   "outputs": [],
+   "source": [
+    "trainer.push_to_hub()"
+   ]
+  }
+ ],
+ "metadata": {
+  "accelerator": "GPU",
+  "colab": {
+   "gpuType": "A100",
+   "provenance": []
+  },
+  "kernelspec": {
+   "display_name": "Python 3 (ipykernel)",
+   "language": "python",
+   "name": "python3"
+  },
+  "language_info": {
+   "codemirror_mode": {
+    "name": "ipython",
+    "version": 3
+   },
+   "file_extension": ".py",
+   "mimetype": "text/x-python",
+   "name": "python",
+   "nbconvert_exporter": "python",
+   "pygments_lexer": "ipython3",
+   "version": "3.12.4"
+  },
+  "widgets": {
+   "application/vnd.jupyter.widget-state+json": {
+    "00617a46b15d45648c4796a91c96ec57": {
+     "model_module": "@jupyter-widgets/controls",
+     "model_module_version": "1.5.0",
+     "model_name": "HTMLModel",
+     "state": {
+      "_dom_classes": [],
+      "_model_module": "@jupyter-widgets/controls",
+      "_model_module_version": "1.5.0",
+      "_model_name": "HTMLModel",
+      "_view_count": null,
+      "_view_module": "@jupyter-widgets/controls",
+      "_view_module_version": "1.5.0",
+      "_view_name": "HTMLView",
+      "description": "",
+      "description_tooltip": null,
+      "layout": "IPY_MODEL_2c975a8158bf49b389d47a5c4e40c97b",
+      "placeholder": "",
+      "style": "IPY_MODEL_b474bf8f464d40d8865665e4c7f0a411",
+      "value": "\n<b>Pro Tip:</b> If you don't already have one, you can create a dedicated\n'notebooks' token with 'write' access, that you can then easily reuse for all\nnotebooks. </center>"
+     }
+    },
+    "261a3abc28d74e4ca5af6f9df8cea3e5": {
+     "model_module": "@jupyter-widgets/controls",
+     "model_module_version": "1.5.0",
+     "model_name": "VBoxModel",
+     "state": {
+      "_dom_classes": [],
+      "_model_module": "@jupyter-widgets/controls",
+      "_model_module_version": "1.5.0",
+      "_model_name": "VBoxModel",
+      "_view_count": null,
+      "_view_module": "@jupyter-widgets/controls",
+      "_view_module_version": "1.5.0",
+      "_view_name": "VBoxView",
+      "box_style": "",
+      "children": [],
+      "layout": "IPY_MODEL_5492da586f594365afc30ee6da1bf67c"
+     }
+    },
+    "2c975a8158bf49b389d47a5c4e40c97b": {
+     "model_module": "@jupyter-widgets/base",
+     "model_module_version": "1.2.0",
+     "model_name": "LayoutModel",
+     "state": {
+      "_model_module": "@jupyter-widgets/base",
+      "_model_module_version": "1.2.0",
+      "_model_name": "LayoutModel",
+      "_view_count": null,
+      "_view_module": "@jupyter-widgets/base",
+      "_view_module_version": "1.2.0",
+      "_view_name": "LayoutView",
+      "align_content": null,
+      "align_items": null,
+      "align_self": null,
+      "border": null,
+      "bottom": null,
+      "display": null,
+      "flex": null,
+      "flex_flow": null,
+      "grid_area": null,
+      "grid_auto_columns": null,
+      "grid_auto_flow": null,
+      "grid_auto_rows": null,
+      "grid_column": null,
+      "grid_gap": null,
+      "grid_row": null,
+      "grid_template_areas": null,
+      "grid_template_columns": null,
+      "grid_template_rows": null,
+      "height": null,
+      "justify_content": null,
+      "justify_items": null,
+      "left": null,
+      "margin": null,
+      "max_height": null,
+      "max_width": null,
+      "min_height": null,
+      "min_width": null,
+      "object_fit": null,
+      "object_position": null,
+      "order": null,
+      "overflow": null,
+      "overflow_x": null,
+      "overflow_y": null,
+      "padding": null,
+      "right": null,
+      "top": null,
+      "visibility": null,
+      "width": null
+     }
+    },
+    "31a0c4a7fcff4744be56adf4125ef4e6": {
+     "model_module": "@jupyter-widgets/controls",
+     "model_module_version": "1.5.0",
+     "model_name": "ButtonStyleModel",
+     "state": {
+      "_model_module": "@jupyter-widgets/controls",
+      "_model_module_version": "1.5.0",
+      "_model_name": "ButtonStyleModel",
+      "_view_count": null,
+      "_view_module": "@jupyter-widgets/base",
+      "_view_module_version": "1.2.0",
+      "_view_name": "StyleView",
+      "button_color": null,
+      "font_weight": ""
+     }
+    },
+    "3aef5e8d5d9e4bd29bd3790ad139c02c": {
+     "model_module": "@jupyter-widgets/controls",
+     "model_module_version": "1.5.0",
+     "model_name": "DescriptionStyleModel",
+     "state": {
+      "_model_module": "@jupyter-widgets/controls",
+      "_model_module_version": "1.5.0",
+      "_model_name": "DescriptionStyleModel",
+      "_view_count": null,
+      "_view_module": "@jupyter-widgets/base",
+      "_view_module_version": "1.2.0",
+      "_view_name": "StyleView",
+      "description_width": ""
+     }
+    },
+    "5492da586f594365afc30ee6da1bf67c": {
+     "model_module": "@jupyter-widgets/base",
+     "model_module_version": "1.2.0",
+     "model_name": "LayoutModel",
+     "state": {
+      "_model_module": "@jupyter-widgets/base",
+      "_model_module_version": "1.2.0",
+      "_model_name": "LayoutModel",
+      "_view_count": null,
+      "_view_module": "@jupyter-widgets/base",
+      "_view_module_version": "1.2.0",
+      "_view_name": "LayoutView",
+      "align_content": null,
+      "align_items": "center",
+      "align_self": null,
+      "border": null,
+      "bottom": null,
+      "display": "flex",
+      "flex": null,
+      "flex_flow": "column",
+      "grid_area": null,
+      "grid_auto_columns": null,
+      "grid_auto_flow": null,
+      "grid_auto_rows": null,
+      "grid_column": null,
+      "grid_gap": null,
+      "grid_row": null,
+      "grid_template_areas": null,
+      "grid_template_columns": null,
+      "grid_template_rows": null,
+      "height": null,
+      "justify_content": null,
+      "justify_items": null,
+      "left": null,
+      "margin": null,
+      "max_height": null,
+      "max_width": null,
+      "min_height": null,
+      "min_width": null,
+      "object_fit": null,
+      "object_position": null,
+      "order": null,
+      "overflow": null,
+      "overflow_x": null,
+      "overflow_y": null,
+      "padding": null,
+      "right": null,
+      "top": null,
+      "visibility": null,
+      "width": "50%"
+     }
+    },
+    "599303d9f1204c85bca500c859dd0d87": {
+     "model_module": "@jupyter-widgets/controls",
+     "model_module_version": "1.5.0",
+     "model_name": "ButtonModel",
+     "state": {
+      "_dom_classes": [],
+      "_model_module": "@jupyter-widgets/controls",
+      "_model_module_version": "1.5.0",
+      "_model_name": "ButtonModel",
+      "_view_count": null,
+      "_view_module": "@jupyter-widgets/controls",
+      "_view_module_version": "1.5.0",
+      "_view_name": "ButtonView",
+      "button_style": "",
+      "description": "Login",
+      "disabled": false,
+      "icon": "",
+      "layout": "IPY_MODEL_94958be916d6439d87dcd45c59178bec",
+      "style": "IPY_MODEL_31a0c4a7fcff4744be56adf4125ef4e6",
+      "tooltip": ""
+     }
+    },
+    "5e529d6d6c4e40b4863961ea63bf259a": {
+     "model_module": "@jupyter-widgets/base",
+     "model_module_version": "1.2.0",
+     "model_name": "LayoutModel",
+     "state": {
+      "_model_module": "@jupyter-widgets/base",
+      "_model_module_version": "1.2.0",
+      "_model_name": "LayoutModel",
+      "_view_count": null,
+      "_view_module": "@jupyter-widgets/base",
+      "_view_module_version": "1.2.0",
+      "_view_name": "LayoutView",
+      "align_content": null,
+      "align_items": null,
+      "align_self": null,
+      "border": null,
+      "bottom": null,
+      "display": null,
+      "flex": null,
+      "flex_flow": null,
+      "grid_area": null,
+      "grid_auto_columns": null,
+      "grid_auto_flow": null,
+      "grid_auto_rows": null,
+      "grid_column": null,
+      "grid_gap": null,
+      "grid_row": null,
+      "grid_template_areas": null,
+      "grid_template_columns": null,
+      "grid_template_rows": null,
+      "height": null,
+      "justify_content": null,
+      "justify_items": null,
+      "left": null,
+      "margin": null,
+      "max_height": null,
+      "max_width": null,
+      "min_height": null,
+      "min_width": null,
+      "object_fit": null,
+      "object_position": null,
+      "order": null,
+      "overflow": null,
+      "overflow_x": null,
+      "overflow_y": null,
+      "padding": null,
+      "right": null,
+      "top": null,
+      "visibility": null,
+      "width": null
+     }
+    },
+    "62c12672f59349b9ade248bee799fa5a": {
+     "model_module": "@jupyter-widgets/controls",
+     "model_module_version": "1.5.0",
+     "model_name": "PasswordModel",
+     "state": {
+      "_dom_classes": [],
+      "_model_module": "@jupyter-widgets/controls",
+      "_model_module_version": "1.5.0",
+      "_model_name": "PasswordModel",
+      "_view_count": null,
+      "_view_module": "@jupyter-widgets/controls",
+      "_view_module_version": "1.5.0",
+      "_view_name": "PasswordView",
+      "continuous_update": true,
+      "description": "Token:",
+      "description_tooltip": null,
+      "disabled": false,
+      "layout": "IPY_MODEL_ed34441fca164b389dfea1eabdba6e4a",
+      "placeholder": "",
+      "style": "IPY_MODEL_99f5b0432c1849128fa181b88925c77b",
+      "value": ""
+     }
+    },
+    "86aa1abb905346bf8956754a9704f250": {
+     "model_module": "@jupyter-widgets/base",
+     "model_module_version": "1.2.0",
+     "model_name": "LayoutModel",
+     "state": {
+      "_model_module": "@jupyter-widgets/base",
+      "_model_module_version": "1.2.0",
+      "_model_name": "LayoutModel",
+      "_view_count": null,
+      "_view_module": "@jupyter-widgets/base",
+      "_view_module_version": "1.2.0",
+      "_view_name": "LayoutView",
+      "align_content": null,
+      "align_items": null,
+      "align_self": null,
+      "border": null,
+      "bottom": null,
+      "display": null,
+      "flex": null,
+      "flex_flow": null,
+      "grid_area": null,
+      "grid_auto_columns": null,
+      "grid_auto_flow": null,
+      "grid_auto_rows": null,
+      "grid_column": null,
+      "grid_gap": null,
+      "grid_row": null,
+      "grid_template_areas": null,
+      "grid_template_columns": null,
+      "grid_template_rows": null,
+      "height": null,
+      "justify_content": null,
+      "justify_items": null,
+      "left": null,
+      "margin": null,
+      "max_height": null,
+      "max_width": null,
+      "min_height": null,
+      "min_width": null,
+      "object_fit": null,
+      "object_position": null,
+      "order": null,
+      "overflow": null,
+      "overflow_x": null,
+      "overflow_y": null,
+      "padding": null,
+      "right": null,
+      "top": null,
+      "visibility": null,
+      "width": null
+     }
+    },
+    "94958be916d6439d87dcd45c59178bec": {
+     "model_module": "@jupyter-widgets/base",
+     "model_module_version": "1.2.0",
+     "model_name": "LayoutModel",
+     "state": {
+      "_model_module": "@jupyter-widgets/base",
+      "_model_module_version": "1.2.0",
+      "_model_name": "LayoutModel",
+      "_view_count": null,
+      "_view_module": "@jupyter-widgets/base",
+      "_view_module_version": "1.2.0",
+      "_view_name": "LayoutView",
+      "align_content": null,
+      "align_items": null,
+      "align_self": null,
+      "border": null,
+      "bottom": null,
+      "display": null,
+      "flex": null,
+      "flex_flow": null,
+      "grid_area": null,
+      "grid_auto_columns": null,
+      "grid_auto_flow": null,
+      "grid_auto_rows": null,
+      "grid_column": null,
+      "grid_gap": null,
+      "grid_row": null,
+      "grid_template_areas": null,
+      "grid_template_columns": null,
+      "grid_template_rows": null,
+      "height": null,
+      "justify_content": null,
+      "justify_items": null,
+      "left": null,
+      "margin": null,
+      "max_height": null,
+      "max_width": null,
+      "min_height": null,
+      "min_width": null,
+      "object_fit": null,
+      "object_position": null,
+      "order": null,
+      "overflow": null,
+      "overflow_x": null,
+      "overflow_y": null,
+      "padding": null,
+      "right": null,
+      "top": null,
+      "visibility": null,
+      "width": null
+     }
+    },
+    "99f5b0432c1849128fa181b88925c77b": {
+     "model_module": "@jupyter-widgets/controls",
+     "model_module_version": "1.5.0",
+     "model_name": "DescriptionStyleModel",
+     "state": {
+      "_model_module": "@jupyter-widgets/controls",
+      "_model_module_version": "1.5.0",
+      "_model_name": "DescriptionStyleModel",
+      "_view_count": null,
+      "_view_module": "@jupyter-widgets/base",
+      "_view_module_version": "1.2.0",
+      "_view_name": "StyleView",
+      "description_width": ""
+     }
+    },
+    "9af532f878ab491096358d3bc83250d8": {
+     "model_module": "@jupyter-widgets/controls",
+     "model_module_version": "1.5.0",
+     "model_name": "CheckboxModel",
+     "state": {
+      "_dom_classes": [],
+      "_model_module": "@jupyter-widgets/controls",
+      "_model_module_version": "1.5.0",
+      "_model_name": "CheckboxModel",
+      "_view_count": null,
+      "_view_module": "@jupyter-widgets/controls",
+      "_view_module_version": "1.5.0",
+      "_view_name": "CheckboxView",
+      "description": "Add token as git credential?",
+      "description_tooltip": null,
+      "disabled": false,
+      "indent": true,
+      "layout": "IPY_MODEL_5e529d6d6c4e40b4863961ea63bf259a",
+      "style": "IPY_MODEL_ebfcd83e42ec46afb772d53ad7f35d43",
+      "value": true
+     }
+    },
+    "b474bf8f464d40d8865665e4c7f0a411": {
+     "model_module": "@jupyter-widgets/controls",
+     "model_module_version": "1.5.0",
+     "model_name": "DescriptionStyleModel",
+     "state": {
+      "_model_module": "@jupyter-widgets/controls",
+      "_model_module_version": "1.5.0",
+      "_model_name": "DescriptionStyleModel",
+      "_view_count": null,
+      "_view_module": "@jupyter-widgets/base",
+      "_view_module_version": "1.2.0",
+      "_view_name": "StyleView",
+      "description_width": ""
+     }
+    },
+    "b6284cfacfd642278a7809a154463d69": {
+     "model_module": "@jupyter-widgets/controls",
+     "model_module_version": "1.5.0",
+     "model_name": "HTMLModel",
+     "state": {
+      "_dom_classes": [],
+      "_model_module": "@jupyter-widgets/controls",
+      "_model_module_version": "1.5.0",
+      "_model_name": "HTMLModel",
+      "_view_count": null,
+      "_view_module": "@jupyter-widgets/controls",
+      "_view_module_version": "1.5.0",
+      "_view_name": "HTMLView",
+      "description": "",
+      "description_tooltip": null,
+      "layout": "IPY_MODEL_86aa1abb905346bf8956754a9704f250",
+      "placeholder": "",
+      "style": "IPY_MODEL_eeb2fbfd6cd54c4aa3983dc334a5377d",
+      "value": "<center> <img\nsrc=https://huggingface.co/front/assets/huggingface_logo-noborder.svg\nalt='Hugging Face'> <br> Copy a token from <a\nhref=\"https://huggingface.co/settings/tokens\" target=\"_blank\">your Hugging Face\ntokens page</a> and paste it below. <br> Immediately click login after copying\nyour token or it might be stored in plain text in this notebook file. </center>"
+     }
+    },
+    "dd08ce6386184df38f47348e547738d8": {
+     "model_module": "@jupyter-widgets/base",
+     "model_module_version": "1.2.0",
+     "model_name": "LayoutModel",
+     "state": {
+      "_model_module": "@jupyter-widgets/base",
+      "_model_module_version": "1.2.0",
+      "_model_name": "LayoutModel",
+      "_view_count": null,
+      "_view_module": "@jupyter-widgets/base",
+      "_view_module_version": "1.2.0",
+      "_view_name": "LayoutView",
+      "align_content": null,
+      "align_items": null,
+      "align_self": null,
+      "border": null,
+      "bottom": null,
+      "display": null,
+      "flex": null,
+      "flex_flow": null,
+      "grid_area": null,
+      "grid_auto_columns": null,
+      "grid_auto_flow": null,
+      "grid_auto_rows": null,
+      "grid_column": null,
+      "grid_gap": null,
+      "grid_row": null,
+      "grid_template_areas": null,
+      "grid_template_columns": null,
+      "grid_template_rows": null,
+      "height": null,
+      "justify_content": null,
+      "justify_items": null,
+      "left": null,
+      "margin": null,
+      "max_height": null,
+      "max_width": null,
+      "min_height": null,
+      "min_width": null,
+      "object_fit": null,
+      "object_position": null,
+      "order": null,
+      "overflow": null,
+      "overflow_x": null,
+      "overflow_y": null,
+      "padding": null,
+      "right": null,
+      "top": null,
+      "visibility": null,
+      "width": null
+     }
+    },
+    "ebfcd83e42ec46afb772d53ad7f35d43": {
+     "model_module": "@jupyter-widgets/controls",
+     "model_module_version": "1.5.0",
+     "model_name": "DescriptionStyleModel",
+     "state": {
+      "_model_module": "@jupyter-widgets/controls",
+      "_model_module_version": "1.5.0",
+      "_model_name": "DescriptionStyleModel",
+      "_view_count": null,
+      "_view_module": "@jupyter-widgets/base",
+      "_view_module_version": "1.2.0",
+      "_view_name": "StyleView",
+      "description_width": ""
+     }
+    },
+    "ed34441fca164b389dfea1eabdba6e4a": {
+     "model_module": "@jupyter-widgets/base",
+     "model_module_version": "1.2.0",
+     "model_name": "LayoutModel",
+     "state": {
+      "_model_module": "@jupyter-widgets/base",
+      "_model_module_version": "1.2.0",
+      "_model_name": "LayoutModel",
+      "_view_count": null,
+      "_view_module": "@jupyter-widgets/base",
+      "_view_module_version": "1.2.0",
+      "_view_name": "LayoutView",
+      "align_content": null,
+      "align_items": null,
+      "align_self": null,
+      "border": null,
+      "bottom": null,
+      "display": null,
+      "flex": null,
+      "flex_flow": null,
+      "grid_area": null,
+      "grid_auto_columns": null,
+      "grid_auto_flow": null,
+      "grid_auto_rows": null,
+      "grid_column": null,
+      "grid_gap": null,
+      "grid_row": null,
+      "grid_template_areas": null,
+      "grid_template_columns": null,
+      "grid_template_rows": null,
+      "height": null,
+      "justify_content": null,
+      "justify_items": null,
+      "left": null,
+      "margin": null,
+      "max_height": null,
+      "max_width": null,
+      "min_height": null,
+      "min_width": null,
+      "object_fit": null,
+      "object_position": null,
+      "order": null,
+      "overflow": null,
+      "overflow_x": null,
+      "overflow_y": null,
+      "padding": null,
+      "right": null,
+      "top": null,
+      "visibility": null,
+      "width": null
+     }
+    },
+    "eeb2fbfd6cd54c4aa3983dc334a5377d": {
+     "model_module": "@jupyter-widgets/controls",
+     "model_module_version": "1.5.0",
+     "model_name": "DescriptionStyleModel",
+     "state": {
+      "_model_module": "@jupyter-widgets/controls",
+      "_model_module_version": "1.5.0",
+      "_model_name": "DescriptionStyleModel",
+      "_view_count": null,
+      "_view_module": "@jupyter-widgets/base",
+      "_view_module_version": "1.2.0",
+      "_view_name": "StyleView",
+      "description_width": ""
+     }
+    },
+    "f8a75ac273fc408f923bf9d7f7263db8": {
+     "model_module": "@jupyter-widgets/controls",
+     "model_module_version": "1.5.0",
+     "model_name": "LabelModel",
+     "state": {
+      "_dom_classes": [],
+      "_model_module": "@jupyter-widgets/controls",
+      "_model_module_version": "1.5.0",
+      "_model_name": "LabelModel",
+      "_view_count": null,
+      "_view_module": "@jupyter-widgets/controls",
+      "_view_module_version": "1.5.0",
+      "_view_name": "LabelView",
+      "description": "",
+      "description_tooltip": null,
+      "layout": "IPY_MODEL_dd08ce6386184df38f47348e547738d8",
+      "placeholder": "",
+      "style": "IPY_MODEL_3aef5e8d5d9e4bd29bd3790ad139c02c",
+      "value": "Connecting..."
+     }
+    }
+   }
+  }
+ },
+ "nbformat": 4,
+ "nbformat_minor": 4
+}

inference_gists/Aria_Inference.ipynb ADDED Viewed

The diff for this file is too large to render. See raw diff

inference_gists/ColQwen2.ipynb ADDED Viewed

The diff for this file is too large to render. See raw diff

inference_gists/IBM_Granite_Vision.ipynb ADDED Viewed

The diff for this file is too large to render. See raw diff

inference_gists/InternVL3_Gist.ipynb ADDED Viewed

The diff for this file is too large to render. See raw diff

knowledge_distillation.md ADDED Viewed

	@@ -0,0 +1,186 @@

+<!--Copyright 2023 The HuggingFace Team. All rights reserved.
+Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with
+the License. You may obtain a copy of the License at
+http://www.apache.org/licenses/LICENSE-2.0
+Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on
+an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the
+specific language governing permissions and limitations under the License.
+⚠️ Note that this file is in Markdown but contain specific syntax for our doc-builder (similar to MDX) that may not be
+rendered properly in your Markdown viewer.
+-->
+# Knowledge Distillation for Computer Vision
+[[open-in-colab]]
+Knowledge distillation is a technique used to transfer knowledge from a larger, more complex model (teacher) to a smaller, simpler model (student). To distill knowledge from one model to another, we take a pre-trained teacher model trained on a certain task (image classification for this case) and randomly initialize a student model to be trained on image classification. Next, we train the student model to minimize the difference between it's outputs and the teacher's outputs, thus making it mimic the behavior. It was first introduced in [Distilling the Knowledge in a Neural Network by Hinton et al](https://arxiv.org/abs/1503.02531). In this guide, we will do task-specific knowledge distillation. We will use the [beans dataset](https://huggingface.co/datasets/beans) for this.
+This guide demonstrates how you can distill a [fine-tuned ViT model](https://huggingface.co/merve/vit-mobilenet-beans-224) (teacher model) to a [MobileNet](https://huggingface.co/google/mobilenet_v2_1.4_224) (student model) using the [Trainer API](https://huggingface.co/docs/transformers/en/main_classes/trainer#trainer) of 🤗 Transformers.
+Let's install the libraries needed for distillation and evaluating the process.
+```bash
+pip install transformers datasets accelerate tensorboard evaluate --upgrade
+```
+In this example, we are using the `merve/beans-vit-224` model as teacher model. It's an image classification model, based on `google/vit-base-patch16-224-in21k` fine-tuned on beans dataset. We will distill this model to a randomly initialized MobileNetV2.
+We will now load the dataset.
+```python
+from datasets import load_dataset
+dataset = load_dataset("beans")
+```
+We can use an image processor from either of the models, as in this case they return the same output with same resolution. We will use the `map()` method of `dataset` to apply the preprocessing to every split of the dataset.
+```python
+from transformers import AutoImageProcessor
+teacher_processor = AutoImageProcessor.from_pretrained("merve/beans-vit-224")
+def process(examples):
+    processed_inputs = teacher_processor(examples["image"])
+    return processed_inputs
+processed_datasets = dataset.map(process, batched=True)
+```
+Essentially, we want the student model (a randomly initialized MobileNet) to mimic the teacher model (fine-tuned vision transformer). To achieve this, we first get the logits output from the teacher and the student. Then, we divide each of them by the parameter `temperature` which controls the importance of each soft target. A parameter called `lambda` weighs the importance of the distillation loss. In this example, we will use `temperature=5` and `lambda=0.5`. We will use the Kullback-Leibler Divergence loss to compute the divergence between the student and teacher. Given two data P and Q, KL Divergence explains how much extra information we need to represent P using Q. If two are identical, their KL divergence is zero, as there's no other information needed to explain P from Q. Thus, in the context of knowledge distillation, KL divergence is useful.
+```python
+from transformers import TrainingArguments, Trainer
+import torch
+import torch.nn as nn
+import torch.nn.functional as F
+class ImageDistilTrainer(Trainer):
+    def __init__(self, teacher_model=None, student_model=None, temperature=None, lambda_param=None,  *args, **kwargs):
+        super().__init__(model=student_model, *args, **kwargs)
+        self.teacher = teacher_model
+        self.student = student_model
+        self.loss_function = nn.KLDivLoss(reduction="batchmean")
+        device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
+        self.teacher.to(device)
+        self.teacher.eval()
+        self.temperature = temperature
+        self.lambda_param = lambda_param
+    def compute_loss(self, student, inputs, return_outputs=False):
+        student_output = self.student(**inputs)
+        with torch.no_grad():
+          teacher_output = self.teacher(**inputs)
+        # Compute soft targets for teacher and student
+        soft_teacher = F.softmax(teacher_output.logits / self.temperature, dim=-1)
+        soft_student = F.log_softmax(student_output.logits / self.temperature, dim=-1)
+        # Compute the loss
+        distillation_loss = self.loss_function(soft_student, soft_teacher) * (self.temperature ** 2)
+        # Compute the true label loss
+        student_target_loss = student_output.loss
+        # Calculate final loss
+        loss = (1. - self.lambda_param) * student_target_loss + self.lambda_param * distillation_loss
+        return (loss, student_output) if return_outputs else loss
+```
+We will now login to Hugging Face Hub so we can push our model to the Hugging Face Hub through the `Trainer`.
+```python
+from huggingface_hub import notebook_login
+notebook_login()
+```
+Let's set the `TrainingArguments`, the teacher model and the student model.
+```python
+from transformers import AutoModelForImageClassification, MobileNetV2Config, MobileNetV2ForImageClassification
+training_args = TrainingArguments(
+    output_dir="my-awesome-model",
+    num_train_epochs=30,
+    fp16=True,
+    logging_dir=f"{repo_name}/logs",
+    logging_strategy="epoch",
+    eval_strategy="epoch",
+    save_strategy="epoch",
+    load_best_model_at_end=True,
+    metric_for_best_model="accuracy",
+    report_to="tensorboard",
+    push_to_hub=True,
+    hub_strategy="every_save",
+    hub_model_id=repo_name,
+    )
+num_labels = len(processed_datasets["train"].features["labels"].names)
+# initialize models
+teacher_model = AutoModelForImageClassification.from_pretrained(
+    "merve/beans-vit-224",
+    num_labels=num_labels,
+    ignore_mismatched_sizes=True
+)
+# training MobileNetV2 from scratch
+student_config = MobileNetV2Config()
+student_config.num_labels = num_labels
+student_model = MobileNetV2ForImageClassification(student_config)
+```
+We can use `compute_metrics` function to evaluate our model on the test set. This function will be used during the training process to compute the `accuracy` & `f1` of our model.
+```python
+import evaluate
+import numpy as np
+accuracy = evaluate.load("accuracy")
+def compute_metrics(eval_pred):
+    predictions, labels = eval_pred
+    acc = accuracy.compute(references=labels, predictions=np.argmax(predictions, axis=1))
+    return {"accuracy": acc["accuracy"]}
+```
+Let's initialize the `Trainer` with the training arguments we defined. We will also initialize our data collator.
+```python
+from transformers import DefaultDataCollator
+data_collator = DefaultDataCollator()
+trainer = ImageDistilTrainer(
+    student_model=student_model,
+    teacher_model=teacher_model,
+    training_args=training_args,
+    train_dataset=processed_datasets["train"],
+    eval_dataset=processed_datasets["validation"],
+    data_collator=data_collator,
+    tokenizer=teacher_processor,
+    compute_metrics=compute_metrics,
+    temperature=5,
+    lambda_param=0.5
+)
+```
+We can now train our model.
+```python
+trainer.train()
+```
+We can evaluate the model on the test set.
+```python
+trainer.evaluate(processed_datasets["test"])
+```
+On test set, our model reaches 72 percent accuracy. To have a sanity check over efficiency of distillation, we also trained MobileNet on the beans dataset from scratch with the same hyperparameters and observed 63 percent accuracy on the test set. We invite the readers to try different pre-trained teacher models, student architectures, distillation parameters and report their findings. The training logs and checkpoints for distilled model can be found in [this repository](https://huggingface.co/merve/vit-mobilenet-beans-224), and MobileNetV2 trained from scratch can be found in this [repository](https://huggingface.co/merve/resnet-mobilenet-beans-5).

paligemma.py ADDED Viewed

	@@ -0,0 +1,91 @@

+from datasets import load_dataset
+import torch
+from transformers import PaliGemmaProcessor, PaliGemmaForConditionalGeneration, Trainer, TrainingArguments, BitsAndBytesConfig
+from peft import get_peft_model, LoraConfig
+import os
+USE_LORA = False
+USE_QLORA = False
+FREEZE_VISION = False
+ds = load_dataset('merve/vqav2-small', split="validation")
+ds = ds.train_test_split(test_size=0.5)["train"]
+model_id = "google/paligemma2-3b-pt-448"
+processor = PaliGemmaProcessor.from_pretrained(model_id)
+device = "cuda" if torch.cuda.is_available() else "cpu"
+image_token = processor.tokenizer.convert_tokens_to_ids("<image>")
+def collate_fn(examples):
+  texts = ["<image>answer en " + example["question"] for example in examples]
+  labels= [example['multiple_choice_answer'] for example in examples]
+  images = [example["image"].convert("RGB") for example in examples]
+  tokens = processor(text=texts, images=images, suffix=labels,
+                    return_tensors="pt", padding="longest")
+  tokens = tokens.to(torch.bfloat16).to(device)
+  return tokens
+if USE_LORA or USE_QLORA:
+    lora_config = LoraConfig(
+    r=8,
+    target_modules=["q_proj", "o_proj", "k_proj", "v_proj", "gate_proj", "up_proj", "down_proj"],
+    task_type="CAUSAL_LM",
+    )
+    if USE_QLORA:
+        bnb_config = BitsAndBytesConfig(
+        load_in_4bit=True,
+        bnb_4bit_quant_type="nf4",
+        bnb_4bit_compute_type=torch.bfloat16
+            )
+    model = PaliGemmaForConditionalGeneration.from_pretrained(model_id, device_map="auto",
+                                    quantization_config=bnb_config if USE_QLORA else None,
+                                    torch_dtype=torch.bfloat16)
+    model = get_peft_model(model, lora_config)
+    model = model.to(device)
+    model.print_trainable_parameters()
+else:
+    model = PaliGemmaForConditionalGeneration.from_pretrained(model_id, device_map="auto").to(device)
+    model = model.to(device)
+    if FREEZE_VISION:
+        for param in model.vision_tower.parameters():
+            param.requires_grad = False
+        for param in model.multi_modal_projector.parameters():
+            param.requires_grad = False
+args=TrainingArguments(
+            num_train_epochs=3,
+            remove_unused_columns=False,
+            per_device_train_batch_size=4,
+            gradient_accumulation_steps=4,
+            warmup_steps=2,
+            learning_rate=2e-5,
+            weight_decay=1e-6,
+            adam_beta2=0.999,
+            logging_steps=100,
+            optim="adamw_hf",
+            save_strategy="steps",
+            save_steps=1000,
+            save_total_limit=1,
+            push_to_hub=True
+            output_dir="paligemma_vqav2",
+            bf16=True,
+            report_to=["tensorboard"],
+            dataloader_pin_memory=False
+        )
+trainer = Trainer(
+        model=model,
+        train_dataset=ds ,
+        data_collator=collate_fn,
+        args=args
+        )
+trainer.train()

smolvlm.py ADDED Viewed

	@@ -0,0 +1,137 @@

+import torch
+from peft import LoraConfig, prepare_model_for_kbit_training, get_peft_model
+from transformers import AutoProcessor, BitsAndBytesConfig, Idefics3ForConditionalGeneration
+from transformers import TrainingArguments, Trainer
+from datasets import load_dataset
+import os
+from PIL import Image
+from transformers.image_utils import load_image
+USE_LORA = False
+USE_QLORA = True
+SMOL = True
+model_id = "HuggingFaceTB/SmolVLM-Base" if SMOL else "HuggingFaceM4/Idefics3-8B-Llama3"
+processor = AutoProcessor.from_pretrained(
+    model_id
+)
+os.environ["CUDA_DEVICE_ORDER"] = "PCI_BUS_ID"
+os.environ["CUDA_VISIBLE_DEVICES"] = "1, 4"
+if USE_QLORA or USE_LORA:
+    lora_config = LoraConfig(
+        r=8,
+        lora_alpha=8,
+        lora_dropout=0.1,
+        target_modules=['down_proj','o_proj','k_proj','q_proj','gate_proj','up_proj','v_proj'],
+        use_dora=False if USE_QLORA else True,
+        init_lora_weights="gaussian"
+    )
+    lora_config.inference_mode = False
+    if USE_QLORA:
+        bnb_config = BitsAndBytesConfig(
+            load_in_4bit=True,
+            bnb_4bit_use_double_quant=True,
+            bnb_4bit_quant_type="nf4",
+            bnb_4bit_compute_dtype=torch.bfloat16
+        )
+    model = Idefics3ForConditionalGeneration.from_pretrained(
+        model_id,
+        quantization_config=bnb_config if USE_QLORA else None,
+        _attn_implementation="flash_attention_2",
+        device_map="auto"
+    )
+    model.add_adapter(lora_config)
+    model.enable_adapters()
+    model = prepare_model_for_kbit_training(model)
+    model = get_peft_model(model, lora_config)
+    print(model.get_nb_trainable_parameters())
+else:
+    model = Idefics3ForConditionalGeneration.from_pretrained(
+        model_id,
+        torch_dtype=torch.bfloat16,
+        _attn_implementation="flash_attention_2",
+    ).to(DEVICE)
+    # if you'd like to only fine-tune LLM
+    for param in model.model.vision_model.parameters():
+        param.requires_grad = False
+ds = load_dataset('merve/vqav2-small', trust_remote_code=True)
+split_ds = ds["validation"].train_test_split(test_size=0.8)
+train_ds = split_ds["train"]
+image_token_id = processor.tokenizer.additional_special_tokens_ids[
+            processor.tokenizer.additional_special_tokens.index("<image>")]
+def collate_fn(examples):
+  texts = []
+  images = []
+  for example in examples:
+      image = example["image"]
+      if image.mode != 'RGB':
+        image = image.convert('RGB')
+      question = example["question"]
+      answer = example["multiple_choice_answer"]
+      messages = [
+          {
+              "role": "user",
+              "content": [
+                  {"type": "text", "text": "Answer briefly."},
+                  {"type": "image"},
+                  {"type": "text", "text": question}
+              ]
+          },
+          {
+              "role": "assistant",
+              "content": [
+                  {"type": "text", "text": answer}
+              ]
+          }
+      ]
+      text = processor.apply_chat_template(messages, add_generation_prompt=False)
+      texts.append(text.strip())
+      images.append([image])
+  batch = processor(text=texts, images=images, return_tensors="pt", padding=True)
+  labels = batch["input_ids"].clone()
+  labels[labels == processor.tokenizer.pad_token_id] = -100
+  labels[labels == image_token_id] = -100
+  batch["labels"] = labels
+  return batch
+model_name = model_id.split("/")[-1]
+training_args = TrainingArguments(
+    num_train_epochs=1,
+    per_device_train_batch_size=8,
+    gradient_accumulation_steps=4,
+    warmup_steps=50,
+    learning_rate=1e-4,
+    weight_decay=0.01,
+    logging_steps=25,
+    save_strategy="steps",
+    save_steps=250,
+    save_total_limit=1,
+    optim="paged_adamw_8bit", # for 8-bit, keep this, else adamw_hf
+    bf16=True, # underlying precision for 8bit
+    output_dir=f"./{model_name}-vqav2",
+    hub_model_id=f"{model_name}-vqav2",
+    report_to="tensorboard",
+    remove_unused_columns=False,
+    gradient_checkpointing=True
+)
+trainer = Trainer(
+    model=model,
+    args=training_args,
+    data_collator=collate_fn,
+    train_dataset=train_ds,
+)
+trainer.train()
+trainer.push_to_hub()

train_idefics2.py ADDED Viewed

	@@ -0,0 +1,132 @@

+import torch
+from peft import LoraConfig, prepare_model_for_kbit_training, get_peft_model
+from transformers import AutoProcessor, BitsAndBytesConfig, Idefics3ForConditionalGeneration
+from datasets import load_dataset
+DEVICE = "cuda:4"
+PCI_BUS_ID=4
+CUDA_VISIBLE_DEVICES=4
+USE_LORA = False
+USE_QLORA = True
+model_id = "HuggingFaceM4/Idefics3-8B-Llama3"
+processor = AutoProcessor.from_pretrained(
+    model_id
+)
+if USE_QLORA or USE_LORA:
+    lora_config = LoraConfig(
+        r=8,
+        lora_alpha=8,
+        lora_dropout=0.1,
+        target_modules=['down_proj','o_proj','k_proj','q_proj','gate_proj','up_proj','v_proj'],
+        use_dora=False if USE_QLORA else True,
+        init_lora_weights="gaussian"
+    )
+    lora_config.inference_mode = False
+    if USE_QLORA:
+        bnb_config = BitsAndBytesConfig(
+            load_in_4bit=True,
+            bnb_4bit_use_double_quant=True,
+            bnb_4bit_quant_type="nf4",
+            bnb_4bit_compute_dtype=torch.bfloat16
+        )
+    model = Idefics3ForConditionalGeneration.from_pretrained(
+        model_id,
+        quantization_config=bnb_config if USE_QLORA else None,
+        _attn_implementation="flash_attention_2",
+        device_map="auto"
+    )
+    model.add_adapter(lora_config)
+    model.enable_adapters()
+    model = prepare_model_for_kbit_training(model)
+    model = get_peft_model(model, lora_config)
+    print(model.get_nb_trainable_parameters())
+else:
+    model = Idefics3ForConditionalGeneration.from_pretrained(
+        model_id,
+        torch_dtype=torch.bfloat16,
+        _attn_implementation="flash_attention_2",
+    ).to(DEVICE)
+    # if you'd like to only fine-tune LLM
+    for param in model.model.vision_model.parameters():
+        param.requires_grad = False
+ds = load_dataset('merve/vqav2-small', trust_remote_code=True)
+split_ds = ds["validation"].train_test_split(test_size=0.8)
+train_ds = split_ds["train"]
+image_token_id = processor.tokenizer.additional_special_tokens_ids[
+            processor.tokenizer.additional_special_tokens.index("<image>")]
+def collate_fn(examples):
+  texts = []
+  images = []
+  for example in examples:
+      image = example["image"]
+      question = example["question"]
+      answer = example["multiple_choice_answer"]
+      messages = [
+          {
+              "role": "user",
+              "content": [
+                  {"type": "text", "text": "Answer briefly."},
+                  {"type": "image"},
+                  {"type": "text", "text": question}
+              ]
+          },
+          {
+              "role": "assistant",
+              "content": [
+                  {"type": "text", "text": answer}
+              ]
+          }
+      ]
+      text = processor.apply_chat_template(messages, add_generation_prompt=False)
+      texts.append(text.strip())
+      images.append([image])
+  batch = processor(text=texts, images=images, return_tensors="pt", padding=True)
+  labels = batch["input_ids"].clone()
+  labels[labels == processor.tokenizer.pad_token_id] = -100
+  labels[labels == image_token_id] = -100
+  batch["labels"] = labels
+  return batch
+from transformers import TrainingArguments, Trainer
+training_args = TrainingArguments(
+    num_train_epochs=1,
+    per_device_train_batch_size=1, # increase for QLoRA
+    gradient_accumulation_steps=8,
+    warmup_steps=50,
+    learning_rate=1e-4,
+    weight_decay=0.01,
+    logging_steps=25,
+    save_strategy="steps",
+    save_steps=250,
+    save_total_limit=1,
+    optim="adamw_hf", # for 8-bit, pick paged_adamw_hf
+    #evaluation_strategy="epoch",
+    bf16=True,
+    output_dir="./idefics3-llama-vqav2",
+    hub_model_id="idefics3-llama-vqav2",
+    remove_unused_columns=False,
+)
+trainer = Trainer(
+    model=model,
+    args=training_args,
+    data_collator=collate_fn,
+    train_dataset=train_ds,
+)
+trainer.train()
+trainer.push_to_hub()