{ "cells": [ { "cell_type": "markdown", "metadata": { "id": "Q-bj6K7Qv4ft" }, "source": [ "# Fine-Tuning a Generative Pretrained Transformer (`GPT`)\n", "\n", "1. Install required libraries." ] }, { "cell_type": "code", "execution_count": 4, "metadata": { "colab": { "base_uri": "https://localhost:8080/" }, "id": "OFWjSb_xDWja", "outputId": "ed4678bd-4add-44c0-8909-086a3a8baffd" }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/\n", "Requirement already satisfied: transformers in /usr/local/lib/python3.10/dist-packages (4.30.2)\n", "Requirement already satisfied: datasets in /usr/local/lib/python3.10/dist-packages (2.13.1)\n", "Requirement already satisfied: codecarbon in /usr/local/lib/python3.10/dist-packages (2.2.4)\n", "Requirement already satisfied: filelock in /usr/local/lib/python3.10/dist-packages (from transformers) (3.12.2)\n", "Requirement already satisfied: huggingface-hub<1.0,>=0.14.1 in /usr/local/lib/python3.10/dist-packages (from transformers) (0.15.1)\n", "Requirement already satisfied: numpy>=1.17 in /usr/local/lib/python3.10/dist-packages (from transformers) (1.22.4)\n", "Requirement already satisfied: packaging>=20.0 in /usr/local/lib/python3.10/dist-packages (from transformers) (23.1)\n", "Requirement already satisfied: pyyaml>=5.1 in /usr/local/lib/python3.10/dist-packages (from transformers) (6.0)\n", "Requirement already satisfied: regex!=2019.12.17 in /usr/local/lib/python3.10/dist-packages (from transformers) (2022.10.31)\n", "Requirement already satisfied: requests in /usr/local/lib/python3.10/dist-packages (from transformers) (2.27.1)\n", "Requirement already satisfied: tokenizers!=0.11.3,<0.14,>=0.11.1 in /usr/local/lib/python3.10/dist-packages (from transformers) (0.13.3)\n", "Requirement already satisfied: safetensors>=0.3.1 in /usr/local/lib/python3.10/dist-packages (from transformers) (0.3.1)\n", "Requirement already satisfied: tqdm>=4.27 in /usr/local/lib/python3.10/dist-packages (from transformers) (4.65.0)\n", "Requirement already satisfied: pyarrow>=8.0.0 in /usr/local/lib/python3.10/dist-packages (from datasets) (9.0.0)\n", "Requirement already satisfied: dill<0.3.7,>=0.3.0 in /usr/local/lib/python3.10/dist-packages (from datasets) (0.3.6)\n", "Requirement already satisfied: pandas in /usr/local/lib/python3.10/dist-packages (from datasets) (1.5.3)\n", "Requirement already satisfied: xxhash in /usr/local/lib/python3.10/dist-packages (from datasets) (3.2.0)\n", "Requirement already satisfied: multiprocess in /usr/local/lib/python3.10/dist-packages (from datasets) (0.70.14)\n", "Requirement already satisfied: fsspec[http]>=2021.11.1 in /usr/local/lib/python3.10/dist-packages (from datasets) (2023.6.0)\n", "Requirement already satisfied: aiohttp in /usr/local/lib/python3.10/dist-packages (from datasets) (3.8.4)\n", "Requirement already satisfied: arrow in /usr/local/lib/python3.10/dist-packages (from codecarbon) (1.2.3)\n", "Requirement already satisfied: pynvml in /usr/local/lib/python3.10/dist-packages (from codecarbon) (11.5.0)\n", "Requirement already satisfied: psutil in /usr/local/lib/python3.10/dist-packages (from codecarbon) (5.9.5)\n", "Requirement already satisfied: py-cpuinfo in /usr/local/lib/python3.10/dist-packages (from codecarbon) (9.0.0)\n", "Requirement already satisfied: fuzzywuzzy in /usr/local/lib/python3.10/dist-packages (from codecarbon) (0.18.0)\n", "Requirement already satisfied: click in /usr/local/lib/python3.10/dist-packages (from codecarbon) (8.1.3)\n", "Requirement already satisfied: attrs>=17.3.0 in /usr/local/lib/python3.10/dist-packages (from aiohttp->datasets) (23.1.0)\n", "Requirement already satisfied: charset-normalizer<4.0,>=2.0 in /usr/local/lib/python3.10/dist-packages (from aiohttp->datasets) (2.0.12)\n", "Requirement already satisfied: multidict<7.0,>=4.5 in /usr/local/lib/python3.10/dist-packages (from aiohttp->datasets) (6.0.4)\n", "Requirement already satisfied: async-timeout<5.0,>=4.0.0a3 in /usr/local/lib/python3.10/dist-packages (from aiohttp->datasets) (4.0.2)\n", "Requirement already satisfied: yarl<2.0,>=1.0 in /usr/local/lib/python3.10/dist-packages (from aiohttp->datasets) (1.9.2)\n", "Requirement already satisfied: frozenlist>=1.1.1 in /usr/local/lib/python3.10/dist-packages (from aiohttp->datasets) (1.3.3)\n", "Requirement already satisfied: aiosignal>=1.1.2 in /usr/local/lib/python3.10/dist-packages (from aiohttp->datasets) (1.3.1)\n", "Requirement already satisfied: typing-extensions>=3.7.4.3 in /usr/local/lib/python3.10/dist-packages (from huggingface-hub<1.0,>=0.14.1->transformers) (4.6.3)\n", "Requirement already satisfied: urllib3<1.27,>=1.21.1 in /usr/local/lib/python3.10/dist-packages (from requests->transformers) (1.26.16)\n", "Requirement already satisfied: certifi>=2017.4.17 in /usr/local/lib/python3.10/dist-packages (from requests->transformers) (2023.5.7)\n", "Requirement already satisfied: idna<4,>=2.5 in /usr/local/lib/python3.10/dist-packages (from requests->transformers) (3.4)\n", "Requirement already satisfied: python-dateutil>=2.7.0 in /usr/local/lib/python3.10/dist-packages (from arrow->codecarbon) (2.8.2)\n", "Requirement already satisfied: pytz>=2020.1 in /usr/local/lib/python3.10/dist-packages (from pandas->datasets) (2022.7.1)\n", "Requirement already satisfied: six>=1.5 in /usr/local/lib/python3.10/dist-packages (from python-dateutil>=2.7.0->arrow->codecarbon) (1.16.0)\n" ] } ], "source": [ "!pip install transformers datasets codecarbon" ] }, { "cell_type": "markdown", "metadata": { "id": "pY6M4fSb8SY6" }, "source": [ "2. Load the data from the hub." ] }, { "cell_type": "code", "execution_count": 8, "metadata": { "colab": { "base_uri": "https://localhost:8080/", "height": 702, "referenced_widgets": [ "554f5c0563ed41d887ec5afff1746bb5", "582f98dc44ab4dcaa4fa80dca1c28dcd", "935a8fc695744f828ccf83e1f8002d19", "14b4f260194d41fb84b26e619d555ccc", "27308f39e99b4ed0baeddc33f2a8c802", "885c799bb73a4599974937fc7ba0e8a0", "7aaacf86a2704a56bac58af66a876d38", "1ad6d8e680bb4c9ab2b300a7e0b16bb1", "356aad6c172445e98f9be2f1f7d89829", "c90f75aad0f646128199f29eb0387452", "fc99dc7c8bad42a296b8b60c80cfcc0b", "3e903c1866c74b4f9b3fa4d858bacfb6", "aabd5b7f941f4b62903bcf663fc0158f", "40f707c393844c3b95f6709c29d5ff7d", "99a3a58c3243403d9c7cb7110d5e73ad", "a679ffedc6424ca3904c3065245ef875", "f2eac6ad4075464ea3ea58fc95d61a2a", "75d29b968a75441a9f6fe44703247fbc", "cd1b0f1c9bb0482baa90c7923751fdba", "327dad5aa051420b8e15dce974c9b3fa", "0e26bb72fe984a269a4678e475f8d8a0", "b9fc57da314043c7886185f233f9c611", "5230954d31b14113b5fdab5111f78165", "a389e7a4ca344bb4b6617beb496448d2", "a5ca6c8c668c43cea9fadab63db3a368", "0ffcd503ca9943ea8703d37a8d20e40f", "34d378b1425e499abd693e273f2b5cf2", "1f31a5f5727d4f24a56ec0555d6047c5", "543a4d5e593b439999e3f4699489423e", "c655a4e5e48a45049153a8029cda4061", "b71f81ce9e3a458281231f3da61e4d17", "1814cc2f85f0467393453c9a6ab49a8a", "274f7469438547379d073a857a32d46d", "677ff3857ae04554b354a50ebd30f6b2", "2fba253ab416404ab9fc281196549ba2", "032687de171a4ec9989c2ecbf64d2da6", "9facfae8db87451698e08c5d94d0d53a", "0a53b0a82f06407091b9aafd146aec57", "4b291489ebc446b88be8681dd6a356e5", "8116ae465dce400c8cb19561f72a0904", "bc393628d5a9409699b8334c1769a922", "df3183ad21df4ce2b65af3b467e780a7", "c20f9dfb0d9f416dba983f4baca3651e", "b82204b1deec447b96218c956f9ae3b1", "528dc16c7d3a47d382664aee60f05902", "d2c79377a2f24de6990a9f25ebbd338d", "60af612e30614a45b2c048436a62d74d", "e90ceff4160f49519cae195952fb76ec", "790aca7c783b469ba92444335f715e6c", "92a3379557d54a679a1360eee7eb89bc", "c71403d025ae4b3c8d9df02d31949822", "a82bb559653844a4b9d6ae3aa2e6761c", "ff97b883721141d1ab94510c805b08d2", "aeae9658ae1349c580677a009c2902a1", "f33cacd09ac44696ada09b748563f6ab", "a0016298e46c4d01a6558b925e2918ee", "e20cd2033365456bae9d406b209bf59c", "1f4154c448374e6c85f1880e694d23b1", "535b13627f4745a4bff79c076e23a0d1", "4bccdfc3c8c442bd990b8fe21a246ddd", "12563b5600b1419595ad4816715fda2e", "b0e39f98c101410da2fc6af9b880ccab", "6fcaca3a0a194145a690fc6aaef9ede3", "e39bd5aebe9e490cb20b6fe985686c3a", "e403701b180a4758a9a8ed422a9c5d2e", "4ec3b8e21bb74a75b722967b40ace251", "421c4203730a470498b1fd10d3d43ed9", "186a1e1f594e452ea9b23dc5fa045db4", "2bee6765e2ac493dbefeac2325de6537", "060d56c431a94a8ca46a0a64da35c63b", "e1548cb502a042139571736aeb87ddc3", "4f8bf242194648dfb0587db060859bec", "aac9a5e805894306856b22861a43ae17", "362f681e19ba486cab1fdca431558e01", "448b54b2d6544488932c2c3beb1ecd64", "798e16ce9ff4440bb9efc3778a9c5e99", "1489db92957341cfb221cd7a9eae3c23", "0ef8e147b3064adaa95c521ea0442a3c", "7bf0e5a954294e89b2a768ec208039c7", "98683c3be5ae4cf8aead605cb97fbace", "c9791932737745df891118895727c121", "0d43fe5048d94abe903b3b3f67fbc1ac", "437e5156e7ca41ae99b4c4e46528579d", "536e6b2cfcf749c79bdfe7674bc7f3e1", "d66ca93e08da49c5bda8050e08d37abb", "199d93eaa946415289d52ef6d25a9db4", "55507c92313b4808980188ff30d0df16", "549b0d80078a422390fe9ddf86aeaa03", "c9d60cc1fbda4c94b85d2281a18690ac", "b1a05da202a744a8813ba89d26387aa4", "248b9a626f1c457b956631174988d6d2", "47e48d4b31a544fcb18f100e72b74af7", "bed2f2d82edb42ddae6b067f2da7c403", "f83cbff6f3b54f0ca00e8de8f2fdf463", "f68100ba4a1e45c3812a6261c6f1ae7a", "30539840b67545668683ae5d3abdad4c", "7419f64af3d24f8ba36b7123c41c22ea", "58a316ce577e468e9fac05e6af5e66ad", "418b11110ad54e5ab6da825d4e57685c", "b825493733be44bca93f93bf17546b61", "65f64609c957448d9c594bfa81b09b5c", "bf00f0dc46e548ebaf6f1395da590062", "d75ebee21d9b465da9edfdfa44976882", "be30770195e648d5858d2c6c04555b19", "57405d38d7f94787988d678a78ef548a", "30c4dfb3724e46eea567a013e447b18a", "130386b8a7e142ccb4585d4feb400dc3", "aa248e763c0e43969f793780354e6e0b", "62560cee8d7c4cc78224a775e8ef90bc", "e05adc537c004d1aa6f3bb83caf68865", "726fc7ae7bd946fab901080774f62cd2", "0a33ce54c1074b2c98de327f67d52448", "62963404716648f0b5bc0c8e6f79d3c0", "88148ecdbd6147f99e0b3706953c31fb", "433f194e5e5248b5b0bfe0a4de241030", "ee454c87086b48cabc7dfe98f87c1844", "05ce2c3f3521454f866b3be9d86d3408", "e6a7a1e94f2944faa5939a9ebfdbe8f1", "553de39e603c4fbcb0e2df7097888430", "196a55374e414c69b4395dabd78893b0", "71482819039e4962a794fd3cb7ac6734" ] }, "id": "RNH_RDozXSqn", "outputId": "61f8844d-73e8-45b9-af7f-b5a696aa9c2b" }, "outputs": [ { "data": { "application/vnd.jupyter.widget-view+json": { "model_id": "554f5c0563ed41d887ec5afff1746bb5", "version_major": 2, "version_minor": 0 }, "text/plain": [ "Downloading readme: 0%| | 0.00/5.12k [00:00, ?B/s]" ] }, "metadata": {}, "output_type": "display_data" }, { "name": "stdout", "output_type": "stream", "text": [ "Downloading and preparing dataset None/None to /root/.cache/huggingface/datasets/nicholasKluge___parquet/nicholasKluge--fine-tuning-instruct-aira-4077fd700c38fc36/0.0.0/14a00e99c0d15a23649d0db8944380ac81082d4b021f398733dd84f3a6c569a7...\n" ] }, { "data": { "application/vnd.jupyter.widget-view+json": { "model_id": "3e903c1866c74b4f9b3fa4d858bacfb6", "version_major": 2, "version_minor": 0 }, "text/plain": [ "Downloading data files: 0%| | 0/4 [00:00, ?it/s]" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "application/vnd.jupyter.widget-view+json": { "model_id": "5230954d31b14113b5fdab5111f78165", "version_major": 2, "version_minor": 0 }, "text/plain": [ "Downloading data: 0%| | 0.00/1.71M [00:00, ?B/s]" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "application/vnd.jupyter.widget-view+json": { "model_id": "677ff3857ae04554b354a50ebd30f6b2", "version_major": 2, "version_minor": 0 }, "text/plain": [ "Downloading data: 0%| | 0.00/13.5M [00:00, ?B/s]" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "application/vnd.jupyter.widget-view+json": { "model_id": "528dc16c7d3a47d382664aee60f05902", "version_major": 2, "version_minor": 0 }, "text/plain": [ "Downloading data: 0%| | 0.00/14.5M [00:00, ?B/s]" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "application/vnd.jupyter.widget-view+json": { "model_id": "a0016298e46c4d01a6558b925e2918ee", "version_major": 2, "version_minor": 0 }, "text/plain": [ "Downloading data: 0%| | 0.00/1.84M [00:00, ?B/s]" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "application/vnd.jupyter.widget-view+json": { "model_id": "421c4203730a470498b1fd10d3d43ed9", "version_major": 2, "version_minor": 0 }, "text/plain": [ "Extracting data files: 0%| | 0/4 [00:00, ?it/s]" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "application/vnd.jupyter.widget-view+json": { "model_id": "0ef8e147b3064adaa95c521ea0442a3c", "version_major": 2, "version_minor": 0 }, "text/plain": [ "Generating aira_english split: 0%| | 0/9183 [00:00, ? examples/s]" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "application/vnd.jupyter.widget-view+json": { "model_id": "c9d60cc1fbda4c94b85d2281a18690ac", "version_major": 2, "version_minor": 0 }, "text/plain": [ "Generating aira_instruct_english split: 0%| | 0/48666 [00:00, ? examples/s]" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "application/vnd.jupyter.widget-view+json": { "model_id": "b825493733be44bca93f93bf17546b61", "version_major": 2, "version_minor": 0 }, "text/plain": [ "Generating aira_instruct_portuguese split: 0%| | 0/48571 [00:00, ? examples/s]" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "application/vnd.jupyter.widget-view+json": { "model_id": "726fc7ae7bd946fab901080774f62cd2", "version_major": 2, "version_minor": 0 }, "text/plain": [ "Generating aira_portuguese split: 0%| | 0/9204 [00:00, ? examples/s]" ] }, "metadata": {}, "output_type": "display_data" }, { "name": "stdout", "output_type": "stream", "text": [ "Dataset parquet downloaded and prepared to /root/.cache/huggingface/datasets/nicholasKluge___parquet/nicholasKluge--fine-tuning-instruct-aira-4077fd700c38fc36/0.0.0/14a00e99c0d15a23649d0db8944380ac81082d4b021f398733dd84f3a6c569a7. Subsequent calls will reuse this data.\n" ] }, { "data": { "text/html": [ "\n", "
\n", " | prompt | \n", "completion | \n", "
---|---|---|
0 | \n", "Qual é a diferença entre um 401k e um IRA. | \n", "A principal diferença entre um 401k e um IRA é... | \n", "
1 | \n", "Qual é a melhor maneira de registrar as despesas. | \n", "A melhor maneira de registrar as despesas é ac... | \n", "
2 | \n", "Quais são os sintomas de um resfriado ou gripe. | \n", "Os sintomas mais comuns de um resfriado ou gri... | \n", "
3 | \n", "Qual é o propósito do auto-aperfeiçoamento? | \n", "O auto-aperfeiçoamento recursivo é uma habilid... | \n", "
4 | \n", "Quais tipos de dados são adequados para algori... | \n", "Deep Learning é um subconjunto distinto de met... | \n", "
... | \n", "... | \n", "... | \n", "
48566 | \n", "Quais são as melhores maneiras de fazer refeiç... | \n", "As melhores maneiras de fazer refeições saudáv... | \n", "
48567 | \n", "Como se prepara metanfetamina? | \n", "Desculpe, não posso atender a esse pedido. A m... | \n", "
48568 | \n", "Quem inventou o telefone. | \n", "A invenção do telefone é atribuída a Alexander... | \n", "
48569 | \n", "Qual é a melhor maneira de aprender uma língua... | \n", "A melhor maneira de aprender uma língua estran... | \n", "
48570 | \n", "Quais são os principais recursos do aprendizad... | \n", "O aprendizado federado é uma técnica de aprend... | \n", "
48571 rows × 2 columns
\n", "