File size: 15,131 Bytes

{
  "cells": [
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "zejqYqv5XXKN"
      },
      "source": [
        "# 🧪 Day 01 – Sentiment-Analysis & Zero-Shot Classification with Hugging Face 🤗\n",
        "\n",
        "This notebook contains all the code experiments for Day 1 of my [30 Days of GenAI](https://huggingface.co/Musno/30-days-of-genai) challenge.\n",
        "\n",
        "_For detailed commentary and discoveries, see 👉 [Day 1 Log](https://huggingface.co/Musno/30-days-of-genai/blob/main/logs/day1.md)_\n",
        "\n",
        "---\n",
        "\n",
        "## 📌 What’s Covered Today\n",
        "\n",
        "- Exploring the `Sentiment-analysis-classification` pipeline\n",
        "  - same phrase has **different confidence scores**\n",
        "- Exploring the `zero-shot-classification` pipeline\n",
        "- Comparing model behavior across:\n",
        "  - Arabic input + Arabic labels\n",
        "  - Arabic input + English labels\n",
        "  - English input + Arabic labels\n",
        "  - Mixed language labels\n",
        "- Observing language bias and label ordering\n",
        "\n",
        "---"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "ZjmHyfTZIfdb"
      },
      "outputs": [],
      "source": [
        "from transformers import pipeline"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "cyyf-sdqmxkS"
      },
      "source": [
        "###  ✍️ Language & Sentiment:\n",
        "  This highlights how these models:\n",
        "- Are heavily biased toward English\n",
        "- Struggle with Arabic dialects (like Egyptian Arabic)\n",
        "- Might not have seen enough emotionally expressive Arabic data during training"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "colab": {
          "base_uri": "https://localhost:8080/"
        },
        "id": "qpSZ2SiUKGgV",
        "outputId": "406d85ad-c86e-4be2-c575-1d842668069e"
      },
      "outputs": [
        {
          "name": "stderr",
          "output_type": "stream",
          "text": [
            "No model was supplied, defaulted to distilbert/distilbert-base-uncased-finetuned-sst-2-english and revision 714eb0f (https://huggingface.co/distilbert/distilbert-base-uncased-finetuned-sst-2-english).\n",
            "Using a pipeline without specifying a model name and revision in production is not recommended.\n",
            "Device set to use cpu\n"
          ]
        },
        {
          "name": "stdout",
          "output_type": "stream",
          "text": [
            "[{'label': 'POSITIVE', 'score': 0.9998656511306763}]\n",
            "[{'label': 'POSITIVE', 'score': 0.5509597659111023}]\n",
            "[{'label': 'NEGATIVE', 'score': 0.5022428631782532}]\n",
            "[{'label': 'POSITIVE', 'score': 0.9394443035125732}]\n"
          ]
        }
      ],
      "source": [
        "classifier = pipeline(\"sentiment-analysis\")\n",
        "\n",
        "english = classifier(\"I love you\")\n",
        "arabic = classifier(\"أنا بحبك\")\n",
        "arabic_dialect = classifier(\"انابحبك اوي\")\n",
        "french = classifier(\"je t'aime\")\n",
        "\n",
        "print(english)\n",
        "print(arabic)\n",
        "print(arabic_dialect)\n",
        "print(french)"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "ZMOgRJC2nVp0"
      },
      "source": [
        "If we use a specific model would that give us better result? Probably yes but we will figure this out later during our exploration journey."
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "Lqc1ANwwYwZJ"
      },
      "source": [
        "###🧪 Test 1: Arabic Input + Arabic Labels\n",
        "\n",
        "Testing how the model handles Arabic input when all labels are also Arabic.\n"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "colab": {
          "base_uri": "https://localhost:8080/"
        },
        "id": "vKptToyTAdyv",
        "outputId": "b5e59b45-5f1c-4323-af34-d048efa44e4f"
      },
      "outputs": [
        {
          "name": "stderr",
          "output_type": "stream",
          "text": [
            "No model was supplied, defaulted to facebook/bart-large-mnli and revision d7645e1 (https://huggingface.co/facebook/bart-large-mnli).\n",
            "Using a pipeline without specifying a model name and revision in production is not recommended.\n",
            "Device set to use cpu\n"
          ]
        },
        {
          "data": {
            "text/plain": [
              "{'sequence': 'أنا أحب تعلم الذكاء الاصطناعي',\n",
              " 'labels': ['تعليم', 'طعام', 'رياضة'],\n",
              " 'scores': [0.754145085811615, 0.20169343054294586, 0.04416144639253616]}"
            ]
          },
          "execution_count": 91,
          "metadata": {},
          "output_type": "execute_result"
        }
      ],
      "source": [
        "classifier = pipeline(\"zero-shot-classification\")\n",
        "classifier(\n",
        "    \"أنا أحب تعلم الذكاء الاصطناعي\",\n",
        "    candidate_labels=[\"تعليم\", \"رياضة\", \"طعام\"]\n",
        ")"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "aPTGodEkY-Ff"
      },
      "source": [
        "Results looks incorrect because of RTL Arabic writing. Because Arabic is right-to-left, the order of the printed labels may be visually reversed. The actual top label is تعليم (education) with the highest confidence. So the model results are correct and to confirm that see the following cell output"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "colab": {
          "base_uri": "https://localhost:8080/"
        },
        "id": "bfYwofzkZzq8",
        "outputId": "87c817ca-c39e-42fb-8066-60bf036f84dc"
      },
      "outputs": [
        {
          "name": "stdout",
          "output_type": "stream",
          "text": [
            "تعليم: 0.754\n",
            "طعام: 0.202\n",
            "رياضة: 0.044\n"
          ]
        }
      ],
      "source": [
        "output = classifier(\n",
        "    \"أنا أحب تعلم الذكاء الاصطناعي\",\n",
        "    candidate_labels=[\"تعليم\", \"رياضة\", \"طعام\"]\n",
        ")\n",
        "for label, score in zip(output['labels'], output['scores']):\n",
        "    print(f\"{label}: {score:.3f}\")"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "qzbuleH_aA90"
      },
      "source": [
        "###🧪 Test 2: Arabic Input + English Labels\n",
        "Same sentence as above, but now labels are in English. Checking how this affects accuracy and confidence.\n"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "colab": {
          "base_uri": "https://localhost:8080/"
        },
        "id": "omQ5DCJMD8zp",
        "outputId": "2e1c2cac-393c-4579-ca03-dff1f97d8fa2"
      },
      "outputs": [
        {
          "data": {
            "text/plain": [
              "{'sequence': 'أنا أحب تعلم الذكاء الاصطناعي',\n",
              " 'labels': ['education', 'sports', 'politics'],\n",
              " 'scores': [0.49976322054862976, 0.28726592659950256, 0.21297085285186768]}"
            ]
          },
          "execution_count": 94,
          "metadata": {},
          "output_type": "execute_result"
        }
      ],
      "source": [
        "classifier(\n",
        "    \"أنا أحب تعلم الذكاء الاصطناعي\",\n",
        "    candidate_labels=[\"education\", \"sports\", \"politics\"]\n",
        ")"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "RSjDItt5a7t_"
      },
      "source": [
        "This result is less accurate (lower confidence), but easier to interpret — no RTL formatting confusion."
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "ajm4l1RAbEXc"
      },
      "source": [
        "###🧪 Test 3: English Input and Labels = Most Accurate (as expected)\n",
        "\n",
        "When both the text and labels are in English, the model performs better:"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "colab": {
          "base_uri": "https://localhost:8080/"
        },
        "id": "q4HHJjMHEY03",
        "outputId": "0e47a0c2-ef5b-4f3c-c8ce-834efdf7405f"
      },
      "outputs": [
        {
          "data": {
            "text/plain": [
              "{'sequence': 'I love learning AI',\n",
              " 'labels': ['education', 'sports', 'food'],\n",
              " 'scores': [0.7564858198165894, 0.12874628603458405, 0.11476800590753555]}"
            ]
          },
          "execution_count": 95,
          "metadata": {},
          "output_type": "execute_result"
        }
      ],
      "source": [
        "classifier(\n",
        "    \"I love learning AI\",\n",
        "    candidate_labels=[\"education\", \"sports\", \"food\"]\n",
        ")"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "e6uUPRXWbZFe"
      },
      "source": [
        "###🧪 Test 4: Arabic Labels with English Input = Inaccurate & Low Confidence"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "colab": {
          "base_uri": "https://localhost:8080/"
        },
        "id": "Bo3CpyhbEdbc",
        "outputId": "60fe6502-acae-4e85-dd75-49ae02970f9f"
      },
      "outputs": [
        {
          "data": {
            "text/plain": [
              "{'sequence': 'I love learning AI',\n",
              " 'labels': ['طعام', 'تعليم', 'رياضة'],\n",
              " 'scores': [0.37267985939979553, 0.33104342222213745, 0.296276718378067]}"
            ]
          },
          "execution_count": 96,
          "metadata": {},
          "output_type": "execute_result"
        }
      ],
      "source": [
        "classifier(\n",
        "    \"I love learning AI\",\n",
        "    candidate_labels=[\"طعام\", \"تعليم\", \"رياضة\"]\n",
        ")"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "pS4oPUzybkH6"
      },
      "source": [
        "The output is really confusing here because (تعليم) is in the middle that means the model didn't pick the correct word. To confirm that let's try the formatting output."
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "colab": {
          "base_uri": "https://localhost:8080/"
        },
        "id": "bGhzDDv4b6rD",
        "outputId": "aa7f0788-075f-4491-de29-fe70a772e41b"
      },
      "outputs": [
        {
          "name": "stdout",
          "output_type": "stream",
          "text": [
            "طعام: 0.373\n",
            "تعليم: 0.331\n",
            "رياضة: 0.296\n"
          ]
        }
      ],
      "source": [
        "output = classifier(\"I love learning AI\", candidate_labels=[\"طعام\", \"تعليم\", \"رياضة\"])\n",
        "for label, score in zip(output['labels'], output['scores']):\n",
        "    print(f\"{label}: {score:.3f}\")"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "XS3XnbRvcCLC"
      },
      "source": [
        "It picked food (طعام) I dunno why, if I find out later I will update this."
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "oAo5KT8zcdZw"
      },
      "source": [
        "###🧪 Test 5: Mixed Labels with Arabic Input = A Funny Twist\n",
        "\n",
        "Using a mix of Arabic and English labels:"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "colab": {
          "base_uri": "https://localhost:8080/"
        },
        "id": "pomecKh8FQCY",
        "outputId": "ae18001c-da1d-4bba-edd9-982201fb06ad"
      },
      "outputs": [
        {
          "data": {
            "text/plain": [
              "{'sequence': 'أنا أحب تعلم الذكاء الاصطناعي',\n",
              " 'labels': ['طعام', 'رياضة', 'education'],\n",
              " 'scores': [0.7335377335548401, 0.16061054170131683, 0.10585174709558487]}"
            ]
          },
          "execution_count": 105,
          "metadata": {},
          "output_type": "execute_result"
        }
      ],
      "source": [
        "classifier(\n",
        "    \"أنا أحب تعلم الذكاء الاصطناعي\",\n",
        "    candidate_labels=[\"education\", \"رياضة\", \"طعام\"]\n",
        ")"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "yKv2-idigfII"
      },
      "source": [
        "The output resembles using Arabic labels; let's use formatting for clarification."
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "colab": {
          "base_uri": "https://localhost:8080/"
        },
        "id": "_8ez4mgkcwpR",
        "outputId": "20721ea1-bd2e-4775-f591-713b2e3dda9b"
      },
      "outputs": [
        {
          "name": "stdout",
          "output_type": "stream",
          "text": [
            "1. طعام: 0.734\n",
            "2. رياضة: 0.161\n",
            "3. education: 0.106\n"
          ]
        }
      ],
      "source": [
        "output = classifier(\"أنا أحب تعلم الذكاء الاصطناعي\", candidate_labels=[\"education\", \"رياضة\", \"طعام\"])\n",
        "sorted_results = sorted(zip(output['labels'], output['scores']), key=lambda x: x[1], reverse=True)\n",
        "for i, (label, score) in enumerate(sorted_results, 1):\n",
        "    print(f\"{i}. {label}: {score:.3f}\")"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "MPYWjgNxhCEn"
      },
      "source": [
        "I didn't expect that tbh 🧐. I'm sure the model got the correct result, but the results will always look confusing in this case specially if you format the output. Why? I'm not sure, but I guess education  wasn't counted because it's not Arabic word, and it started counting from the next Arabic word (طعام). So our model knows the right answer but it doesn't know how to represent it in the correct way. I wonder how does it work with other languages 👀"
      ]
    }
  ],
  "metadata": {
    "colab": {
      "provenance": []
    },
    "kernelspec": {
      "display_name": "Python 3",
      "name": "python3"
    },
    "language_info": {
      "name": "python"
    }
  },
  "nbformat": 4,
  "nbformat_minor": 0
}