Updated log link
Browse files- README.md +1 -1
- logs/day1.md +1 -1
- notebooks/day1.ipynb +162 -162
README.md
CHANGED
@@ -22,7 +22,7 @@ To deepen my understanding of Gen AI, complete the Hugging Face course, build re
|
|
22 |
|
23 |
| Day | Topic | Notebook | Log |
|
24 |
|-----|-------|----------|-----|
|
25 |
-
| 1 | First HF pipelines | [Colab](https://colab.research.google.com/drive/1ysW0sQq01mI9o5uVyaLMM5oCT3pDI41e?usp=sharing) / [Repo](notebooks/day1.ipynb) | [Day 1 Log](
|
26 |
| 2 | ... coming soon | - | - |
|
27 |
|
28 |
## 🔧 Tech Stack
|
|
|
22 |
|
23 |
| Day | Topic | Notebook | Log |
|
24 |
|-----|-------|----------|-----|
|
25 |
+
| 1 | First HF pipelines | [Colab](https://colab.research.google.com/drive/1ysW0sQq01mI9o5uVyaLMM5oCT3pDI41e?usp=sharing) / [Repo](notebooks/day1.ipynb) | [Day 1 Log](https://huggingface.co/Musno/30-days-of-genai/blob/main/logs/day1.md) |
|
26 |
| 2 | ... coming soon | - | - |
|
27 |
|
28 |
## 🔧 Tech Stack
|
logs/day1.md
CHANGED
@@ -1,4 +1,4 @@
|
|
1 |
-
# 📘 Day 1: Sentiment & Zero-Shot
|
2 |
|
3 |
---
|
4 |
|
|
|
1 |
+
# 📘 Day 1: Sentiment & Zero-Shot Classification for Arabic Text
|
2 |
|
3 |
---
|
4 |
|
notebooks/day1.ipynb
CHANGED
@@ -1,27 +1,16 @@
|
|
1 |
{
|
2 |
-
"nbformat": 4,
|
3 |
-
"nbformat_minor": 0,
|
4 |
-
"metadata": {
|
5 |
-
"colab": {
|
6 |
-
"provenance": []
|
7 |
-
},
|
8 |
-
"kernelspec": {
|
9 |
-
"name": "python3",
|
10 |
-
"display_name": "Python 3"
|
11 |
-
},
|
12 |
-
"language_info": {
|
13 |
-
"name": "python"
|
14 |
-
}
|
15 |
-
},
|
16 |
"cells": [
|
17 |
{
|
18 |
"cell_type": "markdown",
|
|
|
|
|
|
|
19 |
"source": [
|
20 |
"# 🧪 Day 01 – Sentiment-Analysis & Zero-Shot Classification with Hugging Face 🤗\n",
|
21 |
"\n",
|
22 |
-
"This notebook contains all the code experiments for Day 1 of my [30 Days of GenAI](https://
|
23 |
"\n",
|
24 |
-
"_For detailed commentary and discoveries, see 👉 [Day 1 Log](logs/day1.md)_\n",
|
25 |
"\n",
|
26 |
"---\n",
|
27 |
"\n",
|
@@ -38,10 +27,7 @@
|
|
38 |
"- Observing language bias and label ordering\n",
|
39 |
"\n",
|
40 |
"---"
|
41 |
-
]
|
42 |
-
"metadata": {
|
43 |
-
"id": "zejqYqv5XXKN"
|
44 |
-
}
|
45 |
},
|
46 |
{
|
47 |
"cell_type": "code",
|
@@ -56,32 +42,20 @@
|
|
56 |
},
|
57 |
{
|
58 |
"cell_type": "markdown",
|
|
|
|
|
|
|
59 |
"source": [
|
60 |
"### ✍️ Language & Sentiment:\n",
|
61 |
" This highlights how these models:\n",
|
62 |
"- Are heavily biased toward English\n",
|
63 |
"- Struggle with Arabic dialects (like Egyptian Arabic)\n",
|
64 |
"- Might not have seen enough emotionally expressive Arabic data during training"
|
65 |
-
]
|
66 |
-
"metadata": {
|
67 |
-
"id": "cyyf-sdqmxkS"
|
68 |
-
}
|
69 |
},
|
70 |
{
|
71 |
"cell_type": "code",
|
72 |
-
"
|
73 |
-
"classifier = pipeline(\"sentiment-analysis\")\n",
|
74 |
-
"\n",
|
75 |
-
"english = classifier(\"I love you\")\n",
|
76 |
-
"arabic = classifier(\"أنا بحبك\")\n",
|
77 |
-
"arabic_dialect = classifier(\"انابحبك اوي\")\n",
|
78 |
-
"french = classifier(\"je t'aime\")\n",
|
79 |
-
"\n",
|
80 |
-
"print(english)\n",
|
81 |
-
"print(arabic)\n",
|
82 |
-
"print(arabic_dialect)\n",
|
83 |
-
"print(french)"
|
84 |
-
],
|
85 |
"metadata": {
|
86 |
"colab": {
|
87 |
"base_uri": "https://localhost:8080/"
|
@@ -89,11 +63,10 @@
|
|
89 |
"id": "qpSZ2SiUKGgV",
|
90 |
"outputId": "406d85ad-c86e-4be2-c575-1d842668069e"
|
91 |
},
|
92 |
-
"execution_count": null,
|
93 |
"outputs": [
|
94 |
{
|
95 |
-
"output_type": "stream",
|
96 |
"name": "stderr",
|
|
|
97 |
"text": [
|
98 |
"No model was supplied, defaulted to distilbert/distilbert-base-uncased-finetuned-sst-2-english and revision 714eb0f (https://huggingface.co/distilbert/distilbert-base-uncased-finetuned-sst-2-english).\n",
|
99 |
"Using a pipeline without specifying a model name and revision in production is not recommended.\n",
|
@@ -101,8 +74,8 @@
|
|
101 |
]
|
102 |
},
|
103 |
{
|
104 |
-
"output_type": "stream",
|
105 |
"name": "stdout",
|
|
|
106 |
"text": [
|
107 |
"[{'label': 'POSITIVE', 'score': 0.9998656511306763}]\n",
|
108 |
"[{'label': 'POSITIVE', 'score': 0.5509597659111023}]\n",
|
@@ -110,37 +83,44 @@
|
|
110 |
"[{'label': 'POSITIVE', 'score': 0.9394443035125732}]\n"
|
111 |
]
|
112 |
}
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
113 |
]
|
114 |
},
|
115 |
{
|
116 |
"cell_type": "markdown",
|
117 |
-
"source": [
|
118 |
-
"If we use a specific model would that give us better result? Probably yes but we will figure this out later during our exploration journey."
|
119 |
-
],
|
120 |
"metadata": {
|
121 |
"id": "ZMOgRJC2nVp0"
|
122 |
-
}
|
|
|
|
|
|
|
123 |
},
|
124 |
{
|
125 |
"cell_type": "markdown",
|
|
|
|
|
|
|
126 |
"source": [
|
127 |
"###🧪 Test 1: Arabic Input + Arabic Labels\n",
|
128 |
"\n",
|
129 |
"Testing how the model handles Arabic input when all labels are also Arabic.\n"
|
130 |
-
]
|
131 |
-
"metadata": {
|
132 |
-
"id": "Lqc1ANwwYwZJ"
|
133 |
-
}
|
134 |
},
|
135 |
{
|
136 |
"cell_type": "code",
|
137 |
-
"
|
138 |
-
"classifier = pipeline(\"zero-shot-classification\")\n",
|
139 |
-
"classifier(\n",
|
140 |
-
" \"أنا أحب تعلم الذكاء الاصطناعي\",\n",
|
141 |
-
" candidate_labels=[\"تعليم\", \"رياضة\", \"طعام\"]\n",
|
142 |
-
")"
|
143 |
-
],
|
144 |
"metadata": {
|
145 |
"colab": {
|
146 |
"base_uri": "https://localhost:8080/"
|
@@ -148,11 +128,10 @@
|
|
148 |
"id": "vKptToyTAdyv",
|
149 |
"outputId": "b5e59b45-5f1c-4323-af34-d048efa44e4f"
|
150 |
},
|
151 |
-
"execution_count": null,
|
152 |
"outputs": [
|
153 |
{
|
154 |
-
"output_type": "stream",
|
155 |
"name": "stderr",
|
|
|
156 |
"text": [
|
157 |
"No model was supplied, defaulted to facebook/bart-large-mnli and revision d7645e1 (https://huggingface.co/facebook/bart-large-mnli).\n",
|
158 |
"Using a pipeline without specifying a model name and revision in production is not recommended.\n",
|
@@ -160,7 +139,6 @@
|
|
160 |
]
|
161 |
},
|
162 |
{
|
163 |
-
"output_type": "execute_result",
|
164 |
"data": {
|
165 |
"text/plain": [
|
166 |
"{'sequence': 'أنا أحب تعلم الذكاء الاصطناعي',\n",
|
@@ -168,30 +146,31 @@
|
|
168 |
" 'scores': [0.754145085811615, 0.20169343054294586, 0.04416144639253616]}"
|
169 |
]
|
170 |
},
|
|
|
171 |
"metadata": {},
|
172 |
-
"
|
173 |
}
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
174 |
]
|
175 |
},
|
176 |
{
|
177 |
"cell_type": "markdown",
|
178 |
-
"source": [
|
179 |
-
"Results looks incorrect because of RTL Arabic writing. Because Arabic is right-to-left, the order of the printed labels may be visually reversed. The actual top label is تعليم (education) with the highest confidence. So the model results are correct and to confirm that see the following cell output"
|
180 |
-
],
|
181 |
"metadata": {
|
182 |
"id": "aPTGodEkY-Ff"
|
183 |
-
}
|
|
|
|
|
|
|
184 |
},
|
185 |
{
|
186 |
"cell_type": "code",
|
187 |
-
"
|
188 |
-
"output = classifier(\n",
|
189 |
-
" \"أنا أحب تعلم الذكاء الاصطناعي\",\n",
|
190 |
-
" candidate_labels=[\"تعليم\", \"رياضة\", \"طعام\"]\n",
|
191 |
-
")\n",
|
192 |
-
"for label, score in zip(output['labels'], output['scores']):\n",
|
193 |
-
" print(f\"{label}: {score:.3f}\")"
|
194 |
-
],
|
195 |
"metadata": {
|
196 |
"colab": {
|
197 |
"base_uri": "https://localhost:8080/"
|
@@ -199,37 +178,39 @@
|
|
199 |
"id": "bfYwofzkZzq8",
|
200 |
"outputId": "87c817ca-c39e-42fb-8066-60bf036f84dc"
|
201 |
},
|
202 |
-
"execution_count": null,
|
203 |
"outputs": [
|
204 |
{
|
205 |
-
"output_type": "stream",
|
206 |
"name": "stdout",
|
|
|
207 |
"text": [
|
208 |
"تعليم: 0.754\n",
|
209 |
"طعام: 0.202\n",
|
210 |
"رياضة: 0.044\n"
|
211 |
]
|
212 |
}
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
213 |
]
|
214 |
},
|
215 |
{
|
216 |
"cell_type": "markdown",
|
|
|
|
|
|
|
217 |
"source": [
|
218 |
"###🧪 Test 2: Arabic Input + English Labels\n",
|
219 |
"Same sentence as above, but now labels are in English. Checking how this affects accuracy and confidence.\n"
|
220 |
-
]
|
221 |
-
"metadata": {
|
222 |
-
"id": "qzbuleH_aA90"
|
223 |
-
}
|
224 |
},
|
225 |
{
|
226 |
"cell_type": "code",
|
227 |
-
"
|
228 |
-
"classifier(\n",
|
229 |
-
" \"أنا أحب تعلم الذكاء الاصطناعي\",\n",
|
230 |
-
" candidate_labels=[\"education\", \"sports\", \"politics\"]\n",
|
231 |
-
")"
|
232 |
-
],
|
233 |
"metadata": {
|
234 |
"colab": {
|
235 |
"base_uri": "https://localhost:8080/"
|
@@ -237,10 +218,8 @@
|
|
237 |
"id": "omQ5DCJMD8zp",
|
238 |
"outputId": "2e1c2cac-393c-4579-ca03-dff1f97d8fa2"
|
239 |
},
|
240 |
-
"execution_count": null,
|
241 |
"outputs": [
|
242 |
{
|
243 |
-
"output_type": "execute_result",
|
244 |
"data": {
|
245 |
"text/plain": [
|
246 |
"{'sequence': 'أنا أحب تعلم الذكاء الاصطناعي',\n",
|
@@ -248,39 +227,41 @@
|
|
248 |
" 'scores': [0.49976322054862976, 0.28726592659950256, 0.21297085285186768]}"
|
249 |
]
|
250 |
},
|
|
|
251 |
"metadata": {},
|
252 |
-
"
|
253 |
}
|
|
|
|
|
|
|
|
|
|
|
|
|
254 |
]
|
255 |
},
|
256 |
{
|
257 |
"cell_type": "markdown",
|
258 |
-
"source": [
|
259 |
-
"This result is less accurate (lower confidence), but easier to interpret — no RTL formatting confusion."
|
260 |
-
],
|
261 |
"metadata": {
|
262 |
"id": "RSjDItt5a7t_"
|
263 |
-
}
|
|
|
|
|
|
|
264 |
},
|
265 |
{
|
266 |
"cell_type": "markdown",
|
|
|
|
|
|
|
267 |
"source": [
|
268 |
"###🧪 Test 3: English Input and Labels = Most Accurate (as expected)\n",
|
269 |
"\n",
|
270 |
"When both the text and labels are in English, the model performs better:"
|
271 |
-
]
|
272 |
-
"metadata": {
|
273 |
-
"id": "ajm4l1RAbEXc"
|
274 |
-
}
|
275 |
},
|
276 |
{
|
277 |
"cell_type": "code",
|
278 |
-
"
|
279 |
-
"classifier(\n",
|
280 |
-
" \"I love learning AI\",\n",
|
281 |
-
" candidate_labels=[\"education\", \"sports\", \"food\"]\n",
|
282 |
-
")"
|
283 |
-
],
|
284 |
"metadata": {
|
285 |
"colab": {
|
286 |
"base_uri": "https://localhost:8080/"
|
@@ -288,10 +269,8 @@
|
|
288 |
"id": "q4HHJjMHEY03",
|
289 |
"outputId": "0e47a0c2-ef5b-4f3c-c8ce-834efdf7405f"
|
290 |
},
|
291 |
-
"execution_count": null,
|
292 |
"outputs": [
|
293 |
{
|
294 |
-
"output_type": "execute_result",
|
295 |
"data": {
|
296 |
"text/plain": [
|
297 |
"{'sequence': 'I love learning AI',\n",
|
@@ -299,28 +278,30 @@
|
|
299 |
" 'scores': [0.7564858198165894, 0.12874628603458405, 0.11476800590753555]}"
|
300 |
]
|
301 |
},
|
|
|
302 |
"metadata": {},
|
303 |
-
"
|
304 |
}
|
|
|
|
|
|
|
|
|
|
|
|
|
305 |
]
|
306 |
},
|
307 |
{
|
308 |
"cell_type": "markdown",
|
309 |
-
"source": [
|
310 |
-
"###🧪 Test 4: Arabic Labels with English Input = Inaccurate & Low Confidence"
|
311 |
-
],
|
312 |
"metadata": {
|
313 |
"id": "e6uUPRXWbZFe"
|
314 |
-
}
|
|
|
|
|
|
|
315 |
},
|
316 |
{
|
317 |
"cell_type": "code",
|
318 |
-
"
|
319 |
-
"classifier(\n",
|
320 |
-
" \"I love learning AI\",\n",
|
321 |
-
" candidate_labels=[\"طعام\", \"تعليم\", \"رياضة\"]\n",
|
322 |
-
")"
|
323 |
-
],
|
324 |
"metadata": {
|
325 |
"colab": {
|
326 |
"base_uri": "https://localhost:8080/"
|
@@ -328,10 +309,8 @@
|
|
328 |
"id": "Bo3CpyhbEdbc",
|
329 |
"outputId": "60fe6502-acae-4e85-dd75-49ae02970f9f"
|
330 |
},
|
331 |
-
"execution_count": null,
|
332 |
"outputs": [
|
333 |
{
|
334 |
-
"output_type": "execute_result",
|
335 |
"data": {
|
336 |
"text/plain": [
|
337 |
"{'sequence': 'I love learning AI',\n",
|
@@ -339,27 +318,30 @@
|
|
339 |
" 'scores': [0.37267985939979553, 0.33104342222213745, 0.296276718378067]}"
|
340 |
]
|
341 |
},
|
|
|
342 |
"metadata": {},
|
343 |
-
"
|
344 |
}
|
|
|
|
|
|
|
|
|
|
|
|
|
345 |
]
|
346 |
},
|
347 |
{
|
348 |
"cell_type": "markdown",
|
349 |
-
"source": [
|
350 |
-
"The output is really confusing here because (تعليم) is in the middle that means the model didn't pick the correct word. To confirm that let's try the formatting output."
|
351 |
-
],
|
352 |
"metadata": {
|
353 |
"id": "pS4oPUzybkH6"
|
354 |
-
}
|
|
|
|
|
|
|
355 |
},
|
356 |
{
|
357 |
"cell_type": "code",
|
358 |
-
"
|
359 |
-
"output = classifier(\"I love learning AI\", candidate_labels=[\"طعام\", \"تعليم\", \"رياضة\"])\n",
|
360 |
-
"for label, score in zip(output['labels'], output['scores']):\n",
|
361 |
-
" print(f\"{label}: {score:.3f}\")"
|
362 |
-
],
|
363 |
"metadata": {
|
364 |
"colab": {
|
365 |
"base_uri": "https://localhost:8080/"
|
@@ -367,47 +349,46 @@
|
|
367 |
"id": "bGhzDDv4b6rD",
|
368 |
"outputId": "aa7f0788-075f-4491-de29-fe70a772e41b"
|
369 |
},
|
370 |
-
"execution_count": null,
|
371 |
"outputs": [
|
372 |
{
|
373 |
-
"output_type": "stream",
|
374 |
"name": "stdout",
|
|
|
375 |
"text": [
|
376 |
"طعام: 0.373\n",
|
377 |
"تعليم: 0.331\n",
|
378 |
"رياضة: 0.296\n"
|
379 |
]
|
380 |
}
|
|
|
|
|
|
|
|
|
|
|
381 |
]
|
382 |
},
|
383 |
{
|
384 |
"cell_type": "markdown",
|
385 |
-
"source": [
|
386 |
-
"It picked food (طعام) I dunno why, if I find out later I will update this."
|
387 |
-
],
|
388 |
"metadata": {
|
389 |
"id": "XS3XnbRvcCLC"
|
390 |
-
}
|
|
|
|
|
|
|
391 |
},
|
392 |
{
|
393 |
"cell_type": "markdown",
|
|
|
|
|
|
|
394 |
"source": [
|
395 |
"###🧪 Test 5: Mixed Labels with Arabic Input = A Funny Twist\n",
|
396 |
"\n",
|
397 |
"Using a mix of Arabic and English labels:"
|
398 |
-
]
|
399 |
-
"metadata": {
|
400 |
-
"id": "oAo5KT8zcdZw"
|
401 |
-
}
|
402 |
},
|
403 |
{
|
404 |
"cell_type": "code",
|
405 |
-
"
|
406 |
-
"classifier(\n",
|
407 |
-
" \"أنا أحب تعلم الذكاء الاصطناعي\",\n",
|
408 |
-
" candidate_labels=[\"education\", \"رياضة\", \"طعام\"]\n",
|
409 |
-
")"
|
410 |
-
],
|
411 |
"metadata": {
|
412 |
"colab": {
|
413 |
"base_uri": "https://localhost:8080/"
|
@@ -415,10 +396,8 @@
|
|
415 |
"id": "pomecKh8FQCY",
|
416 |
"outputId": "ae18001c-da1d-4bba-edd9-982201fb06ad"
|
417 |
},
|
418 |
-
"execution_count": null,
|
419 |
"outputs": [
|
420 |
{
|
421 |
-
"output_type": "execute_result",
|
422 |
"data": {
|
423 |
"text/plain": [
|
424 |
"{'sequence': 'أنا أحب تعلم الذكاء الاصطناعي',\n",
|
@@ -426,28 +405,30 @@
|
|
426 |
" 'scores': [0.7335377335548401, 0.16061054170131683, 0.10585174709558487]}"
|
427 |
]
|
428 |
},
|
|
|
429 |
"metadata": {},
|
430 |
-
"
|
431 |
}
|
|
|
|
|
|
|
|
|
|
|
|
|
432 |
]
|
433 |
},
|
434 |
{
|
435 |
"cell_type": "markdown",
|
436 |
-
"source": [
|
437 |
-
"The output resembles using Arabic labels; let's use formatting for clarification."
|
438 |
-
],
|
439 |
"metadata": {
|
440 |
"id": "yKv2-idigfII"
|
441 |
-
}
|
|
|
|
|
|
|
442 |
},
|
443 |
{
|
444 |
"cell_type": "code",
|
445 |
-
"
|
446 |
-
"output = classifier(\"أنا أحب تعلم الذكاء الاصطناعي\", candidate_labels=[\"education\", \"رياضة\", \"طعام\"])\n",
|
447 |
-
"sorted_results = sorted(zip(output['labels'], output['scores']), key=lambda x: x[1], reverse=True)\n",
|
448 |
-
"for i, (label, score) in enumerate(sorted_results, 1):\n",
|
449 |
-
" print(f\"{i}. {label}: {score:.3f}\")"
|
450 |
-
],
|
451 |
"metadata": {
|
452 |
"colab": {
|
453 |
"base_uri": "https://localhost:8080/"
|
@@ -455,27 +436,46 @@
|
|
455 |
"id": "_8ez4mgkcwpR",
|
456 |
"outputId": "20721ea1-bd2e-4775-f591-713b2e3dda9b"
|
457 |
},
|
458 |
-
"execution_count": null,
|
459 |
"outputs": [
|
460 |
{
|
461 |
-
"output_type": "stream",
|
462 |
"name": "stdout",
|
|
|
463 |
"text": [
|
464 |
"1. طعام: 0.734\n",
|
465 |
"2. رياضة: 0.161\n",
|
466 |
"3. education: 0.106\n"
|
467 |
]
|
468 |
}
|
|
|
|
|
|
|
|
|
|
|
|
|
469 |
]
|
470 |
},
|
471 |
{
|
472 |
"cell_type": "markdown",
|
473 |
-
"source": [
|
474 |
-
"I didn't expect that tbh 🧐. I'm sure the model got the correct result, but the results will always look confusing in this case specially if you format the output. Why? I'm not sure, but I guess education wasn't counted because it's not Arabic word, and it started counting from the next Arabic word (طعام). So our model knows the right answer but it doesn't know how to represent it in the correct way. I wonder how does it work with other languages 👀"
|
475 |
-
],
|
476 |
"metadata": {
|
477 |
"id": "MPYWjgNxhCEn"
|
478 |
-
}
|
|
|
|
|
|
|
479 |
}
|
480 |
-
]
|
481 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
{
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
2 |
"cells": [
|
3 |
{
|
4 |
"cell_type": "markdown",
|
5 |
+
"metadata": {
|
6 |
+
"id": "zejqYqv5XXKN"
|
7 |
+
},
|
8 |
"source": [
|
9 |
"# 🧪 Day 01 – Sentiment-Analysis & Zero-Shot Classification with Hugging Face 🤗\n",
|
10 |
"\n",
|
11 |
+
"This notebook contains all the code experiments for Day 1 of my [30 Days of GenAI](https://huggingface.co/Musno/30-days-of-genai) challenge.\n",
|
12 |
"\n",
|
13 |
+
"_For detailed commentary and discoveries, see 👉 [Day 1 Log](https://huggingface.co/Musno/30-days-of-genai/blob/main/logs/day1.md)_\n",
|
14 |
"\n",
|
15 |
"---\n",
|
16 |
"\n",
|
|
|
27 |
"- Observing language bias and label ordering\n",
|
28 |
"\n",
|
29 |
"---"
|
30 |
+
]
|
|
|
|
|
|
|
31 |
},
|
32 |
{
|
33 |
"cell_type": "code",
|
|
|
42 |
},
|
43 |
{
|
44 |
"cell_type": "markdown",
|
45 |
+
"metadata": {
|
46 |
+
"id": "cyyf-sdqmxkS"
|
47 |
+
},
|
48 |
"source": [
|
49 |
"### ✍️ Language & Sentiment:\n",
|
50 |
" This highlights how these models:\n",
|
51 |
"- Are heavily biased toward English\n",
|
52 |
"- Struggle with Arabic dialects (like Egyptian Arabic)\n",
|
53 |
"- Might not have seen enough emotionally expressive Arabic data during training"
|
54 |
+
]
|
|
|
|
|
|
|
55 |
},
|
56 |
{
|
57 |
"cell_type": "code",
|
58 |
+
"execution_count": null,
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
59 |
"metadata": {
|
60 |
"colab": {
|
61 |
"base_uri": "https://localhost:8080/"
|
|
|
63 |
"id": "qpSZ2SiUKGgV",
|
64 |
"outputId": "406d85ad-c86e-4be2-c575-1d842668069e"
|
65 |
},
|
|
|
66 |
"outputs": [
|
67 |
{
|
|
|
68 |
"name": "stderr",
|
69 |
+
"output_type": "stream",
|
70 |
"text": [
|
71 |
"No model was supplied, defaulted to distilbert/distilbert-base-uncased-finetuned-sst-2-english and revision 714eb0f (https://huggingface.co/distilbert/distilbert-base-uncased-finetuned-sst-2-english).\n",
|
72 |
"Using a pipeline without specifying a model name and revision in production is not recommended.\n",
|
|
|
74 |
]
|
75 |
},
|
76 |
{
|
|
|
77 |
"name": "stdout",
|
78 |
+
"output_type": "stream",
|
79 |
"text": [
|
80 |
"[{'label': 'POSITIVE', 'score': 0.9998656511306763}]\n",
|
81 |
"[{'label': 'POSITIVE', 'score': 0.5509597659111023}]\n",
|
|
|
83 |
"[{'label': 'POSITIVE', 'score': 0.9394443035125732}]\n"
|
84 |
]
|
85 |
}
|
86 |
+
],
|
87 |
+
"source": [
|
88 |
+
"classifier = pipeline(\"sentiment-analysis\")\n",
|
89 |
+
"\n",
|
90 |
+
"english = classifier(\"I love you\")\n",
|
91 |
+
"arabic = classifier(\"أنا بحبك\")\n",
|
92 |
+
"arabic_dialect = classifier(\"انابحبك اوي\")\n",
|
93 |
+
"french = classifier(\"je t'aime\")\n",
|
94 |
+
"\n",
|
95 |
+
"print(english)\n",
|
96 |
+
"print(arabic)\n",
|
97 |
+
"print(arabic_dialect)\n",
|
98 |
+
"print(french)"
|
99 |
]
|
100 |
},
|
101 |
{
|
102 |
"cell_type": "markdown",
|
|
|
|
|
|
|
103 |
"metadata": {
|
104 |
"id": "ZMOgRJC2nVp0"
|
105 |
+
},
|
106 |
+
"source": [
|
107 |
+
"If we use a specific model would that give us better result? Probably yes but we will figure this out later during our exploration journey."
|
108 |
+
]
|
109 |
},
|
110 |
{
|
111 |
"cell_type": "markdown",
|
112 |
+
"metadata": {
|
113 |
+
"id": "Lqc1ANwwYwZJ"
|
114 |
+
},
|
115 |
"source": [
|
116 |
"###🧪 Test 1: Arabic Input + Arabic Labels\n",
|
117 |
"\n",
|
118 |
"Testing how the model handles Arabic input when all labels are also Arabic.\n"
|
119 |
+
]
|
|
|
|
|
|
|
120 |
},
|
121 |
{
|
122 |
"cell_type": "code",
|
123 |
+
"execution_count": null,
|
|
|
|
|
|
|
|
|
|
|
|
|
124 |
"metadata": {
|
125 |
"colab": {
|
126 |
"base_uri": "https://localhost:8080/"
|
|
|
128 |
"id": "vKptToyTAdyv",
|
129 |
"outputId": "b5e59b45-5f1c-4323-af34-d048efa44e4f"
|
130 |
},
|
|
|
131 |
"outputs": [
|
132 |
{
|
|
|
133 |
"name": "stderr",
|
134 |
+
"output_type": "stream",
|
135 |
"text": [
|
136 |
"No model was supplied, defaulted to facebook/bart-large-mnli and revision d7645e1 (https://huggingface.co/facebook/bart-large-mnli).\n",
|
137 |
"Using a pipeline without specifying a model name and revision in production is not recommended.\n",
|
|
|
139 |
]
|
140 |
},
|
141 |
{
|
|
|
142 |
"data": {
|
143 |
"text/plain": [
|
144 |
"{'sequence': 'أنا أحب تعلم الذكاء الاصطناعي',\n",
|
|
|
146 |
" 'scores': [0.754145085811615, 0.20169343054294586, 0.04416144639253616]}"
|
147 |
]
|
148 |
},
|
149 |
+
"execution_count": 91,
|
150 |
"metadata": {},
|
151 |
+
"output_type": "execute_result"
|
152 |
}
|
153 |
+
],
|
154 |
+
"source": [
|
155 |
+
"classifier = pipeline(\"zero-shot-classification\")\n",
|
156 |
+
"classifier(\n",
|
157 |
+
" \"أنا أحب تعلم الذكاء الاصطناعي\",\n",
|
158 |
+
" candidate_labels=[\"تعليم\", \"رياضة\", \"طعام\"]\n",
|
159 |
+
")"
|
160 |
]
|
161 |
},
|
162 |
{
|
163 |
"cell_type": "markdown",
|
|
|
|
|
|
|
164 |
"metadata": {
|
165 |
"id": "aPTGodEkY-Ff"
|
166 |
+
},
|
167 |
+
"source": [
|
168 |
+
"Results looks incorrect because of RTL Arabic writing. Because Arabic is right-to-left, the order of the printed labels may be visually reversed. The actual top label is تعليم (education) with the highest confidence. So the model results are correct and to confirm that see the following cell output"
|
169 |
+
]
|
170 |
},
|
171 |
{
|
172 |
"cell_type": "code",
|
173 |
+
"execution_count": null,
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
174 |
"metadata": {
|
175 |
"colab": {
|
176 |
"base_uri": "https://localhost:8080/"
|
|
|
178 |
"id": "bfYwofzkZzq8",
|
179 |
"outputId": "87c817ca-c39e-42fb-8066-60bf036f84dc"
|
180 |
},
|
|
|
181 |
"outputs": [
|
182 |
{
|
|
|
183 |
"name": "stdout",
|
184 |
+
"output_type": "stream",
|
185 |
"text": [
|
186 |
"تعليم: 0.754\n",
|
187 |
"طعام: 0.202\n",
|
188 |
"رياضة: 0.044\n"
|
189 |
]
|
190 |
}
|
191 |
+
],
|
192 |
+
"source": [
|
193 |
+
"output = classifier(\n",
|
194 |
+
" \"أنا أحب تعلم الذكاء الاصطناعي\",\n",
|
195 |
+
" candidate_labels=[\"تعليم\", \"رياضة\", \"طعام\"]\n",
|
196 |
+
")\n",
|
197 |
+
"for label, score in zip(output['labels'], output['scores']):\n",
|
198 |
+
" print(f\"{label}: {score:.3f}\")"
|
199 |
]
|
200 |
},
|
201 |
{
|
202 |
"cell_type": "markdown",
|
203 |
+
"metadata": {
|
204 |
+
"id": "qzbuleH_aA90"
|
205 |
+
},
|
206 |
"source": [
|
207 |
"###🧪 Test 2: Arabic Input + English Labels\n",
|
208 |
"Same sentence as above, but now labels are in English. Checking how this affects accuracy and confidence.\n"
|
209 |
+
]
|
|
|
|
|
|
|
210 |
},
|
211 |
{
|
212 |
"cell_type": "code",
|
213 |
+
"execution_count": null,
|
|
|
|
|
|
|
|
|
|
|
214 |
"metadata": {
|
215 |
"colab": {
|
216 |
"base_uri": "https://localhost:8080/"
|
|
|
218 |
"id": "omQ5DCJMD8zp",
|
219 |
"outputId": "2e1c2cac-393c-4579-ca03-dff1f97d8fa2"
|
220 |
},
|
|
|
221 |
"outputs": [
|
222 |
{
|
|
|
223 |
"data": {
|
224 |
"text/plain": [
|
225 |
"{'sequence': 'أنا أحب تعلم الذكاء الاصطناعي',\n",
|
|
|
227 |
" 'scores': [0.49976322054862976, 0.28726592659950256, 0.21297085285186768]}"
|
228 |
]
|
229 |
},
|
230 |
+
"execution_count": 94,
|
231 |
"metadata": {},
|
232 |
+
"output_type": "execute_result"
|
233 |
}
|
234 |
+
],
|
235 |
+
"source": [
|
236 |
+
"classifier(\n",
|
237 |
+
" \"أنا أحب تعلم الذكاء الاصطناعي\",\n",
|
238 |
+
" candidate_labels=[\"education\", \"sports\", \"politics\"]\n",
|
239 |
+
")"
|
240 |
]
|
241 |
},
|
242 |
{
|
243 |
"cell_type": "markdown",
|
|
|
|
|
|
|
244 |
"metadata": {
|
245 |
"id": "RSjDItt5a7t_"
|
246 |
+
},
|
247 |
+
"source": [
|
248 |
+
"This result is less accurate (lower confidence), but easier to interpret — no RTL formatting confusion."
|
249 |
+
]
|
250 |
},
|
251 |
{
|
252 |
"cell_type": "markdown",
|
253 |
+
"metadata": {
|
254 |
+
"id": "ajm4l1RAbEXc"
|
255 |
+
},
|
256 |
"source": [
|
257 |
"###🧪 Test 3: English Input and Labels = Most Accurate (as expected)\n",
|
258 |
"\n",
|
259 |
"When both the text and labels are in English, the model performs better:"
|
260 |
+
]
|
|
|
|
|
|
|
261 |
},
|
262 |
{
|
263 |
"cell_type": "code",
|
264 |
+
"execution_count": null,
|
|
|
|
|
|
|
|
|
|
|
265 |
"metadata": {
|
266 |
"colab": {
|
267 |
"base_uri": "https://localhost:8080/"
|
|
|
269 |
"id": "q4HHJjMHEY03",
|
270 |
"outputId": "0e47a0c2-ef5b-4f3c-c8ce-834efdf7405f"
|
271 |
},
|
|
|
272 |
"outputs": [
|
273 |
{
|
|
|
274 |
"data": {
|
275 |
"text/plain": [
|
276 |
"{'sequence': 'I love learning AI',\n",
|
|
|
278 |
" 'scores': [0.7564858198165894, 0.12874628603458405, 0.11476800590753555]}"
|
279 |
]
|
280 |
},
|
281 |
+
"execution_count": 95,
|
282 |
"metadata": {},
|
283 |
+
"output_type": "execute_result"
|
284 |
}
|
285 |
+
],
|
286 |
+
"source": [
|
287 |
+
"classifier(\n",
|
288 |
+
" \"I love learning AI\",\n",
|
289 |
+
" candidate_labels=[\"education\", \"sports\", \"food\"]\n",
|
290 |
+
")"
|
291 |
]
|
292 |
},
|
293 |
{
|
294 |
"cell_type": "markdown",
|
|
|
|
|
|
|
295 |
"metadata": {
|
296 |
"id": "e6uUPRXWbZFe"
|
297 |
+
},
|
298 |
+
"source": [
|
299 |
+
"###🧪 Test 4: Arabic Labels with English Input = Inaccurate & Low Confidence"
|
300 |
+
]
|
301 |
},
|
302 |
{
|
303 |
"cell_type": "code",
|
304 |
+
"execution_count": null,
|
|
|
|
|
|
|
|
|
|
|
305 |
"metadata": {
|
306 |
"colab": {
|
307 |
"base_uri": "https://localhost:8080/"
|
|
|
309 |
"id": "Bo3CpyhbEdbc",
|
310 |
"outputId": "60fe6502-acae-4e85-dd75-49ae02970f9f"
|
311 |
},
|
|
|
312 |
"outputs": [
|
313 |
{
|
|
|
314 |
"data": {
|
315 |
"text/plain": [
|
316 |
"{'sequence': 'I love learning AI',\n",
|
|
|
318 |
" 'scores': [0.37267985939979553, 0.33104342222213745, 0.296276718378067]}"
|
319 |
]
|
320 |
},
|
321 |
+
"execution_count": 96,
|
322 |
"metadata": {},
|
323 |
+
"output_type": "execute_result"
|
324 |
}
|
325 |
+
],
|
326 |
+
"source": [
|
327 |
+
"classifier(\n",
|
328 |
+
" \"I love learning AI\",\n",
|
329 |
+
" candidate_labels=[\"طعام\", \"تعليم\", \"رياضة\"]\n",
|
330 |
+
")"
|
331 |
]
|
332 |
},
|
333 |
{
|
334 |
"cell_type": "markdown",
|
|
|
|
|
|
|
335 |
"metadata": {
|
336 |
"id": "pS4oPUzybkH6"
|
337 |
+
},
|
338 |
+
"source": [
|
339 |
+
"The output is really confusing here because (تعليم) is in the middle that means the model didn't pick the correct word. To confirm that let's try the formatting output."
|
340 |
+
]
|
341 |
},
|
342 |
{
|
343 |
"cell_type": "code",
|
344 |
+
"execution_count": null,
|
|
|
|
|
|
|
|
|
345 |
"metadata": {
|
346 |
"colab": {
|
347 |
"base_uri": "https://localhost:8080/"
|
|
|
349 |
"id": "bGhzDDv4b6rD",
|
350 |
"outputId": "aa7f0788-075f-4491-de29-fe70a772e41b"
|
351 |
},
|
|
|
352 |
"outputs": [
|
353 |
{
|
|
|
354 |
"name": "stdout",
|
355 |
+
"output_type": "stream",
|
356 |
"text": [
|
357 |
"طعام: 0.373\n",
|
358 |
"تعليم: 0.331\n",
|
359 |
"رياضة: 0.296\n"
|
360 |
]
|
361 |
}
|
362 |
+
],
|
363 |
+
"source": [
|
364 |
+
"output = classifier(\"I love learning AI\", candidate_labels=[\"طعام\", \"تعليم\", \"رياضة\"])\n",
|
365 |
+
"for label, score in zip(output['labels'], output['scores']):\n",
|
366 |
+
" print(f\"{label}: {score:.3f}\")"
|
367 |
]
|
368 |
},
|
369 |
{
|
370 |
"cell_type": "markdown",
|
|
|
|
|
|
|
371 |
"metadata": {
|
372 |
"id": "XS3XnbRvcCLC"
|
373 |
+
},
|
374 |
+
"source": [
|
375 |
+
"It picked food (طعام) I dunno why, if I find out later I will update this."
|
376 |
+
]
|
377 |
},
|
378 |
{
|
379 |
"cell_type": "markdown",
|
380 |
+
"metadata": {
|
381 |
+
"id": "oAo5KT8zcdZw"
|
382 |
+
},
|
383 |
"source": [
|
384 |
"###🧪 Test 5: Mixed Labels with Arabic Input = A Funny Twist\n",
|
385 |
"\n",
|
386 |
"Using a mix of Arabic and English labels:"
|
387 |
+
]
|
|
|
|
|
|
|
388 |
},
|
389 |
{
|
390 |
"cell_type": "code",
|
391 |
+
"execution_count": null,
|
|
|
|
|
|
|
|
|
|
|
392 |
"metadata": {
|
393 |
"colab": {
|
394 |
"base_uri": "https://localhost:8080/"
|
|
|
396 |
"id": "pomecKh8FQCY",
|
397 |
"outputId": "ae18001c-da1d-4bba-edd9-982201fb06ad"
|
398 |
},
|
|
|
399 |
"outputs": [
|
400 |
{
|
|
|
401 |
"data": {
|
402 |
"text/plain": [
|
403 |
"{'sequence': 'أنا أحب تعلم الذكاء الاصطناعي',\n",
|
|
|
405 |
" 'scores': [0.7335377335548401, 0.16061054170131683, 0.10585174709558487]}"
|
406 |
]
|
407 |
},
|
408 |
+
"execution_count": 105,
|
409 |
"metadata": {},
|
410 |
+
"output_type": "execute_result"
|
411 |
}
|
412 |
+
],
|
413 |
+
"source": [
|
414 |
+
"classifier(\n",
|
415 |
+
" \"أنا أحب تعلم الذكاء الاصطناعي\",\n",
|
416 |
+
" candidate_labels=[\"education\", \"رياضة\", \"طعام\"]\n",
|
417 |
+
")"
|
418 |
]
|
419 |
},
|
420 |
{
|
421 |
"cell_type": "markdown",
|
|
|
|
|
|
|
422 |
"metadata": {
|
423 |
"id": "yKv2-idigfII"
|
424 |
+
},
|
425 |
+
"source": [
|
426 |
+
"The output resembles using Arabic labels; let's use formatting for clarification."
|
427 |
+
]
|
428 |
},
|
429 |
{
|
430 |
"cell_type": "code",
|
431 |
+
"execution_count": null,
|
|
|
|
|
|
|
|
|
|
|
432 |
"metadata": {
|
433 |
"colab": {
|
434 |
"base_uri": "https://localhost:8080/"
|
|
|
436 |
"id": "_8ez4mgkcwpR",
|
437 |
"outputId": "20721ea1-bd2e-4775-f591-713b2e3dda9b"
|
438 |
},
|
|
|
439 |
"outputs": [
|
440 |
{
|
|
|
441 |
"name": "stdout",
|
442 |
+
"output_type": "stream",
|
443 |
"text": [
|
444 |
"1. طعام: 0.734\n",
|
445 |
"2. رياضة: 0.161\n",
|
446 |
"3. education: 0.106\n"
|
447 |
]
|
448 |
}
|
449 |
+
],
|
450 |
+
"source": [
|
451 |
+
"output = classifier(\"أنا أحب تعلم الذكاء الاصطناعي\", candidate_labels=[\"education\", \"رياضة\", \"طعام\"])\n",
|
452 |
+
"sorted_results = sorted(zip(output['labels'], output['scores']), key=lambda x: x[1], reverse=True)\n",
|
453 |
+
"for i, (label, score) in enumerate(sorted_results, 1):\n",
|
454 |
+
" print(f\"{i}. {label}: {score:.3f}\")"
|
455 |
]
|
456 |
},
|
457 |
{
|
458 |
"cell_type": "markdown",
|
|
|
|
|
|
|
459 |
"metadata": {
|
460 |
"id": "MPYWjgNxhCEn"
|
461 |
+
},
|
462 |
+
"source": [
|
463 |
+
"I didn't expect that tbh 🧐. I'm sure the model got the correct result, but the results will always look confusing in this case specially if you format the output. Why? I'm not sure, but I guess education wasn't counted because it's not Arabic word, and it started counting from the next Arabic word (طعام). So our model knows the right answer but it doesn't know how to represent it in the correct way. I wonder how does it work with other languages 👀"
|
464 |
+
]
|
465 |
}
|
466 |
+
],
|
467 |
+
"metadata": {
|
468 |
+
"colab": {
|
469 |
+
"provenance": []
|
470 |
+
},
|
471 |
+
"kernelspec": {
|
472 |
+
"display_name": "Python 3",
|
473 |
+
"name": "python3"
|
474 |
+
},
|
475 |
+
"language_info": {
|
476 |
+
"name": "python"
|
477 |
+
}
|
478 |
+
},
|
479 |
+
"nbformat": 4,
|
480 |
+
"nbformat_minor": 0
|
481 |
+
}
|