docs: Day 3 notebook, polished log, final summary, and Day 4 vision added
Browse files- Refined Day 3 testing notes on summarization and translation
- Added a final summary highlighting current gaps in Arabic NLP
- Outlined Day 4 goals covering QA, NER, Fill-Mask, and text generation
- Continued emphasis on comparing default vs. fine-tuned models for Arabic
Progressing toward a full benchmark of Hugging Face pipelines for Arabic NLP 🇪🇬🇸🇦🚀
- README.md +2 -1
- logs/day2.md +1 -1
- logs/day3.md +183 -0
- notebooks/day3.ipynb +637 -0
README.md
CHANGED
@@ -24,7 +24,8 @@ To deepen my understanding of Gen AI, complete the Hugging Face course, build re
|
|
24 |
|-----|-------|----------|-----|
|
25 |
| 1 | First HF pipelines | [Colab](https://colab.research.google.com/drive/1ysW0sQq01mI9o5uVyaLMM5oCT3pDI41e?usp=sharing) / [Repo](notebooks/day1.ipynb) | [Day 1 Log](logs/day1.md) |
|
26 |
| 2 | Beyond Default | [Colab](https://colab.research.google.com/drive/1h9AC5_Oe5eXtD0zkHdPo64aHG_9hapwD?usp=sharing) / [Repo](notebooks/day2.ipynb) | [Day 2 Log](logs/day2.md) |
|
27 |
-
| 3 |
|
|
|
28 |
|
29 |
|
30 |
## 🔧 Tech Stack
|
|
|
24 |
|-----|-------|----------|-----|
|
25 |
| 1 | First HF pipelines | [Colab](https://colab.research.google.com/drive/1ysW0sQq01mI9o5uVyaLMM5oCT3pDI41e?usp=sharing) / [Repo](notebooks/day1.ipynb) | [Day 1 Log](logs/day1.md) |
|
26 |
| 2 | Beyond Default | [Colab](https://colab.research.google.com/drive/1h9AC5_Oe5eXtD0zkHdPo64aHG_9hapwD?usp=sharing) / [Repo](notebooks/day2.ipynb) | [Day 2 Log](logs/day2.md) |
|
27 |
+
| 3 | Summarization & Translation Deep Dive | [Colab](https://colab.research.google.com/drive/1CuD1NErkmrTRebbnXnFl5tLxG-A0NHnD?usp=sharing) / [Repo](notebooks/day3.ipynb) | [Day 3 Log](logs/day3.md) |
|
28 |
+
| 4 | ... coming soon... | - | - |
|
29 |
|
30 |
|
31 |
## 🔧 Tech Stack
|
logs/day2.md
CHANGED
@@ -22,7 +22,7 @@ Previously, I found that the default `pipeline("sentiment-analysis")` worked oka
|
|
22 |
- 💬 **Dialect-Friendly**: Correctly classified Egyptian slang like
|
23 |
`"الواد سواق التوك توك جارنا عسل"` → **Positive 97%**
|
24 |
|
25 |
-
- ⚠️ **Weakness**: Lower performance on English (
|
26 |
|
27 |
|
28 |
### 🧠 Key Takeaways
|
|
|
22 |
- 💬 **Dialect-Friendly**: Correctly classified Egyptian slang like
|
23 |
`"الواد سواق التوك توك جارنا عسل"` → **Positive 97%**
|
24 |
|
25 |
+
- ⚠️ **Weakness**: Lower performance on English (61%) and French (50%)
|
26 |
|
27 |
|
28 |
### 🧠 Key Takeaways
|
logs/day3.md
ADDED
@@ -0,0 +1,183 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
# ✍️ **Day 03 – Deep Dive into Summarization & Translation with Hugging Face 🤗**
|
2 |
+
|
3 |
+
Today marks our exploration of two advanced generative NLP tasks beyond classification: **Text Summarization** and **Machine Translation**. We’ll compare default Hugging Face pipelines to language-specific models, emphasizing Arabic.
|
4 |
+
|
5 |
+
---
|
6 |
+
|
7 |
+
## 📝 1. Text Summarization
|
8 |
+
|
9 |
+
### 1.1 Overview
|
10 |
+
|
11 |
+
- **Baseline:** Default `summarization` pipeline on English narrative.
|
12 |
+
|
13 |
+
- **English-focused models:** Compare `facebook/bart-large-cnn`, `csebuetnlp/mT5_multilingual_XLSum`, and `Falconsai/text_summarization` with length parameters.
|
14 |
+
|
15 |
+
- **Arabic narrative:** Assess multilingual vs. Arabic-specialized models on Arabic text.
|
16 |
+
|
17 |
+
|
18 |
+
### 1.2 Experiment 1: Default Pipeline on English Narrative
|
19 |
+
|
20 |
+
```python
|
21 |
+
from transformers import pipeline
|
22 |
+
|
23 |
+
summarizer = pipeline("summarization")
|
24 |
+
# This is a short example
|
25 |
+
naruto_story = """
|
26 |
+
Born an orphan into the Hidden Leaf Village, Naruto Uzumaki's early life was shadowed by the terrifying Nine-Tailed Fox, a monstrous beast sealed within him."""
|
27 |
+
|
28 |
+
# Generate the summary
|
29 |
+
summary_default = summarizer(naruto_story)
|
30 |
+
|
31 |
+
# Print the result
|
32 |
+
print("--- Original Story ---")
|
33 |
+
print(naruto_story)
|
34 |
+
print("\n--- Default Summarizer Output ---")
|
35 |
+
print(summary_default[0]['summary_text'])
|
36 |
+
```
|
37 |
+
|
38 |
+
- **Model:** `sshleifer/distilbart-cnn-12-6` (default).
|
39 |
+
|
40 |
+
- **Input:** A Naruto Uzumaki story (with/without an initial title line).
|
41 |
+
|
42 |
+
|
43 |
+
**Key Observations:**
|
44 |
+
|
45 |
+
1. **Conciseness:** The summary distilled only the core arc (orphan → Hokage).
|
46 |
+
|
47 |
+
2. **Title Sensitivity:** With the title line present, the model labeled Naruto as “The Seventh Hokage” and omitted his name; removing the title restored “Naruto.”
|
48 |
+
|
49 |
+
3. **Omission of Details:** Side characters (Sasuke, Jiraiya, etc.) and subplots were dropped due to aggressive compression.
|
50 |
+
|
51 |
+
|
52 |
+
> **Insight:** Useful for quick overviews but lacks narrative richness and requires parameter tuning or fine-tuned models for detail retention.
|
53 |
+
|
54 |
+
### 1.3 Experiment 2: Fine-Tuned English Models
|
55 |
+
|
56 |
+
#### 1.3.1 facebook/bart-large-cnn
|
57 |
+
|
58 |
+
- **Pros:** More verbose; includes “Naruto Uzumaki”.
|
59 |
+
|
60 |
+
- **Cons:** Hallucination: misgendered Naruto as Kushina’s daughter.
|
61 |
+
|
62 |
+
|
63 |
+
#### 1.3.2 csebuetnlp/mT5_multilingual_XLSum
|
64 |
+
|
65 |
+
- **Issue:** Severe hallucinations; treated narrative like news, fabricating details (e.g., Konoha setting, BBC reporter).
|
66 |
+
|
67 |
+
|
68 |
+
#### 1.3.3 Falconsai/text_summarization
|
69 |
+
|
70 |
+
```python
|
71 |
+
# Load the fine-tuned summarization model
|
72 |
+
summarizer = pipeline("summarization", model="Falconsai/text_summarization")
|
73 |
+
|
74 |
+
# Experiment with increased max_length to get more detail
|
75 |
+
summarizer = summarizer(naruto_story, max_length=562, min_length=100, do_sample=False)
|
76 |
+
|
77 |
+
print("\n--- Fine-Tuned model on English Naruto Story ---")
|
78 |
+
print(summarizer[0]['summary_text'])
|
79 |
+
```
|
80 |
+
|
81 |
+
- **Setup:** `max_length=562`, `min_length=100`, `do_sample=False`.
|
82 |
+
|
83 |
+
- **Performance:** Rich, coherent summary including multiple characters and plot points; minor truncation at `max_length` cutoff.
|
84 |
+
|
85 |
+
|
86 |
+
> **Conclusion:** For English narrative, **Falconsai/text_summarization** offers the best balance of detail and accuracy.
|
87 |
+
|
88 |
+
### 1.4 Experiment 3: Arabic Narrative Summarization
|
89 |
+
|
90 |
+
- **Model:** csebuetnlp/mT5_multilingual_XLSum (with `min_length=100`).
|
91 |
+
|
92 |
+
|
93 |
+
**Findings:**
|
94 |
+
|
95 |
+
1. Hallucinations persisted; invented BBC Arabic interview segments.
|
96 |
+
|
97 |
+
2. Other Arabic or multilingual models similarly fabricated content.
|
98 |
+
|
99 |
+
3. English-tuned models produced garbled output on Arabic input.
|
100 |
+
|
101 |
+
|
102 |
+
> **Conclusion:** Off-the-shelf Arabic summarization models on Hugging Face currently exhibit unreliable hallucinations. Custom fine-tuning on Arabic narratives or larger Arabic LLMs may be required.
|
103 |
+
|
104 |
+
---
|
105 |
+
|
106 |
+
## 🌐 2. Machine Translation Deep Dive
|
107 |
+
|
108 |
+
### 2.1 Scope
|
109 |
+
|
110 |
+
- **Focus:** Translate between English ↔ Modern Standard Arabic (MSA) and Arabic dialects.
|
111 |
+
|
112 |
+
- **Models Tested:**
|
113 |
+
|
114 |
+
1. `facebook/nllb-200-distilled-600M`
|
115 |
+
|
116 |
+
2. `Helsinki-NLP/opus-mt-ar-en`
|
117 |
+
|
118 |
+
3. `Helsinki-NLP/opus-mt-en-ar`
|
119 |
+
|
120 |
+
4. `Helsinki-NLP/opus-mt-mul-en`
|
121 |
+
|
122 |
+
|
123 |
+
### 2.2 Experiment Results
|
124 |
+
|
125 |
+
|Model|MSA ↔ EN|Dialectal AR → EN|Notes|
|
126 |
+
|---|---|---|---|
|
127 |
+
|nllb-200-distilled-600M|Strong, fluent|Partial transliteration (“Yasta I am tired”)|Requires explicit language codes.|
|
128 |
+
|opus-mt-ar-en|Good formal AR → EN|Struggled; literal or omitted slang|Tends toward brevity.|
|
129 |
+
|opus-mt-en-ar|Weak EN → AR|N/A|Incomplete outputs; unreliable.|
|
130 |
+
|opus-mt-mul-en|Good formal AR → EN|Poor on dialects|Multilingual training offers no advantage on dialects.|
|
131 |
+
|
132 |
+
> **Conclusion:** MSA translation is well-supported. Dialects remain a hurdle; **NLLB** shows promise via its recognition/transliteration of colloquialisms. Specialized fine-tuning or larger LLMs needed for robust dialect handling.
|
133 |
+
|
134 |
+
---
|
135 |
+
### 🧠 Final Summary for Day 3
|
136 |
+
|
137 |
+
Today’s deep dive revealed both the capabilities and current limitations of open-source models when applied to Arabic-centric tasks:
|
138 |
+
|
139 |
+
📝 **Summarization:** English summaries are generally handled well—especially by models like `Falconsai/text_summarization`—producing coherent and detailed outputs. However, Arabic summarization continues to struggle with hallucinations and fragmented narratives, underscoring the need for Arabic-specific fine-tuning and better cultural grounding.
|
140 |
+
|
141 |
+
🌐 **Translation:** Modern Standard Arabic (MSA) is reasonably well-supported across several models. In contrast, Arabic dialects remain a major challenge, often yielding transliterations or contextually inaccurate translations. Among tested models, `facebook/nllb-200-distilled-600M` showed the most potential, particularly when used with explicit language codes.
|
142 |
+
|
143 |
+
More broadly, these experiments highlight the ongoing hurdles posed by linguistic diversity, dialectal variation, and cultural nuance—even for advanced multilingual systems. This experience strengthens my motivation to keep learning and, ultimately, contribute to building more inclusive tools for Arabic-speaking communities. 🌍💡
|
144 |
+
|
145 |
+
---
|
146 |
+
### 🔭 Vision for Day 4
|
147 |
+
|
148 |
+
Tomorrow’s mission is to wrap up all text-focused pipelines, completing the core set of foundational NLP tasks before shifting gears into vision models.
|
149 |
+
|
150 |
+
📌 **Pipelines to Explore:**
|
151 |
+
|
152 |
+
1. **Question Answering**
|
153 |
+
|
154 |
+
- Compare default vs. Arabic-optimized models
|
155 |
+
|
156 |
+
- Test with both MSA and dialectal inputs
|
157 |
+
|
158 |
+
- Evaluate performance on short vs. long contexts
|
159 |
+
|
160 |
+
2. **Named Entity Recognition (NER)**
|
161 |
+
|
162 |
+
- Assess entity extraction accuracy in Arabic and English
|
163 |
+
|
164 |
+
- Look for confusion or missed entities, especially with dialect-specific names or terms
|
165 |
+
|
166 |
+
3. **Fill-Mask**
|
167 |
+
|
168 |
+
- Use models like `bert-base-multilingual-cased` and Arabic BERT variants
|
169 |
+
|
170 |
+
- Observe predictions on varied inputs, including poetry, idioms, and slang
|
171 |
+
|
172 |
+
4. **Text Generation**
|
173 |
+
|
174 |
+
- Experiment with `gpt2`, `mGPT`, and Arabic GPT models
|
175 |
+
|
176 |
+
- Evaluate fluency, coherence, and hallucination tendencies
|
177 |
+
|
178 |
+
|
179 |
+
---
|
180 |
+
|
181 |
+
🔁 **Goal:** Continue comparing default models with fine-tuned alternatives.
|
182 |
+
💡 **Mindset:** We're not just running tests — we're mapping the current landscape of Arabic in open-source NLP.
|
183 |
+
🎯 **Outcome:** By the end of Day 4, we’ll have a comprehensive understanding of Hugging Face’s strengths and gaps in multilingual text processing.
|
notebooks/day3.ipynb
ADDED
@@ -0,0 +1,637 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
{
|
2 |
+
"nbformat": 4,
|
3 |
+
"nbformat_minor": 0,
|
4 |
+
"metadata": {
|
5 |
+
"colab": {
|
6 |
+
"provenance": []
|
7 |
+
},
|
8 |
+
"kernelspec": {
|
9 |
+
"name": "python3",
|
10 |
+
"display_name": "Python 3"
|
11 |
+
},
|
12 |
+
"language_info": {
|
13 |
+
"name": "python"
|
14 |
+
}
|
15 |
+
},
|
16 |
+
"cells": [
|
17 |
+
{
|
18 |
+
"cell_type": "markdown",
|
19 |
+
"source": [
|
20 |
+
"# ✍️ **Day 03 – Summarization & Translation Deep Dive with Hugging Face 🤗**\n",
|
21 |
+
"\n",
|
22 |
+
"This notebook contains all the code experiments for **Day 3** of my *30 Days of GenAI* challenge.\n",
|
23 |
+
"\n",
|
24 |
+
"For detailed commentary and discoveries, see 👉 [Day 3 Log](https://huggingface.co/Musno/30-days-of-genai/blob/main/logs/day3.md)\n",
|
25 |
+
"\n",
|
26 |
+
"---\n",
|
27 |
+
"\n",
|
28 |
+
"## 📌 What’s Covered Today\n",
|
29 |
+
"\n",
|
30 |
+
"Today, we're broadening our horizons beyond classification to explore two powerful generative NLP tasks: **Text Summarization** and **Machine Translation**. Our focus will be on understanding the capabilities of default Hugging Face pipelines versus models specifically fine-tuned for Arabic.\n",
|
31 |
+
"\n",
|
32 |
+
"Here’s our game plan:\n",
|
33 |
+
"\n",
|
34 |
+
"### 📝 Text Summarization\n",
|
35 |
+
"- Initial exploration with the **default summarization pipeline** using English text to establish a baseline.\n",
|
36 |
+
"- Evaluating the performance of **Arabic-specific summarization models** against the default, both with English and Arabic inputs, to observe the impact of specialized training.\n",
|
37 |
+
"\n",
|
38 |
+
"### 🌐 Machine Translation\n",
|
39 |
+
"- Testing translation capabilities between **English and Modern Standard Arabic (MSA)**, examining both directions (EN -> MSA, MSA -> EN) using fine-tuned models. We anticipate strong performance here.\n",
|
40 |
+
"- Tackling the more challenging task of translating **Arabic Dialects to English and vice-versa**. This is where we expect to see significant differences and highlight the necessity of dialect-aware models.\n",
|
41 |
+
"\n",
|
42 |
+
"Let’s dive in and uncover the nuances of text generation! 🚀\n",
|
43 |
+
"\n",
|
44 |
+
"---"
|
45 |
+
],
|
46 |
+
"metadata": {
|
47 |
+
"id": "IoWElNuiSktA"
|
48 |
+
}
|
49 |
+
},
|
50 |
+
{
|
51 |
+
"cell_type": "code",
|
52 |
+
"execution_count": 2,
|
53 |
+
"metadata": {
|
54 |
+
"id": "2Dnba_uXPlH8"
|
55 |
+
},
|
56 |
+
"outputs": [],
|
57 |
+
"source": [
|
58 |
+
"from transformers import pipeline"
|
59 |
+
]
|
60 |
+
},
|
61 |
+
{
|
62 |
+
"cell_type": "markdown",
|
63 |
+
"source": [
|
64 |
+
"### 📝 Summarization Experiment 1: Default Pipeline with Narrative Text (English)\n",
|
65 |
+
"\n",
|
66 |
+
"---\n",
|
67 |
+
"\n",
|
68 |
+
"For our first exploration into text summarization, we'll use the default Hugging Face `summarization` pipeline without specifying a particular model or length parameters. This will give us a baseline understanding of how a general-purpose model handles narrative content, specifically a story about a well-known fictional character like Naruto Uzumaki. We want to see how much detail it retains and its overall summarization style.\n",
|
69 |
+
"\n",
|
70 |
+
"---"
|
71 |
+
],
|
72 |
+
"metadata": {
|
73 |
+
"id": "FDunD64VikZW"
|
74 |
+
}
|
75 |
+
},
|
76 |
+
{
|
77 |
+
"cell_type": "code",
|
78 |
+
"source": [
|
79 |
+
"summarizer = pipeline(\"summarization\")\n",
|
80 |
+
"\n",
|
81 |
+
"\n",
|
82 |
+
"# Access the model's configuration\n",
|
83 |
+
"# The '_name_or_path' attribute often holds the model ID\n",
|
84 |
+
"# print(f\"The default summarization model loaded is: {summarizer.model.config._name_or_path}\")\n",
|
85 |
+
"\n",
|
86 |
+
"# You can also get more details about the model\n",
|
87 |
+
"# print(summarizer.model.config)\n",
|
88 |
+
"\n",
|
89 |
+
"\n",
|
90 |
+
"# The long story about Naruto\n",
|
91 |
+
"naruto_story = \"\"\"\n",
|
92 |
+
"Born an orphan into the Hidden Leaf Village, Naruto Uzumaki's early life was shadowed by the terrifying Nine-Tailed Fox, a monstrous beast sealed within him. This secret led to him being ostracized and feared by the villagers, forcing a young Naruto to desperately seek attention and validation through pranks and a boisterous personality. His unwavering dream, however, was to become the Hokage, the village leader, a path he believed would finally earn him the respect and love he craved.\n",
|
93 |
+
"\n",
|
94 |
+
"His journey began with humble ninja training, forming Team 7 with the aloof Sasuke Uchiha, his rival and eventual best friend, and the intelligent Sakura Haruno, under the guidance of the enigmatic Kakashi Hatake. Early missions, like confronting Zabuza and Haku in the Land of Waves, forged bonds and revealed Naruto's hidden potential and fierce loyalty. As he grew, he faced numerous personal and global conflicts. The heart-wrenching pursuit of Sasuke, driven by revenge and Orochimaru's manipulation, became a central struggle, pushing Naruto to immense power, including mastering the Rasengan under the tutelage of his beloved mentor, Jiraiya. Jiraiya's tragic death at the hands of Pain, a former student, was a profound blow, yet it fueled Naruto's resolve, leading him to confront Pain and bring peace to the devastated Konoha, finally earning the villagers' acknowledgment and admiration.\n",
|
95 |
+
"\n",
|
96 |
+
"The Fourth Great Ninja War tested Naruto's strength and conviction to their limits. During this cataclysmic conflict, he confronted the harsh truths of his heritage, had a heart-touching conversation with his resurrected mother, Kushina, and fought alongside his father, Minato, the Fourth Hokage. His ultimate clash with Sasuke, a final, world-altering battle at the Valley of the End, brought their complex relationship to a poignant resolution. Through relentless effort, unwavering belief in his friends, and an extraordinary capacity for empathy that allowed him to change even the hearts of his enemies, Naruto eventually achieved his childhood dream. He became the Seventh Hokage, the revered protector and hero of Konohagakure, guiding a new generation and finally fulfilling his promise to himself and his village.\n",
|
97 |
+
"\"\"\"\n",
|
98 |
+
"\n",
|
99 |
+
"# Generate the summary\n",
|
100 |
+
"summary_default = summarizer(naruto_story)\n",
|
101 |
+
"\n",
|
102 |
+
"# Print the result\n",
|
103 |
+
"print(\"--- Original Story ---\")\n",
|
104 |
+
"print(naruto_story)\n",
|
105 |
+
"print(\"\\n--- Default Summarizer Output ---\")\n",
|
106 |
+
"print(summary_default[0]['summary_text'])"
|
107 |
+
],
|
108 |
+
"metadata": {
|
109 |
+
"colab": {
|
110 |
+
"base_uri": "https://localhost:8080/"
|
111 |
+
},
|
112 |
+
"id": "y9Qt8LvhUUIR",
|
113 |
+
"outputId": "50f7993d-24d1-440c-be66-6c97b79c7d12"
|
114 |
+
},
|
115 |
+
"execution_count": 14,
|
116 |
+
"outputs": [
|
117 |
+
{
|
118 |
+
"output_type": "stream",
|
119 |
+
"name": "stderr",
|
120 |
+
"text": [
|
121 |
+
"No model was supplied, defaulted to sshleifer/distilbart-cnn-12-6 and revision a4f8f3e (https://huggingface.co/sshleifer/distilbart-cnn-12-6).\n",
|
122 |
+
"Using a pipeline without specifying a model name and revision in production is not recommended.\n",
|
123 |
+
"Device set to use cpu\n"
|
124 |
+
]
|
125 |
+
},
|
126 |
+
{
|
127 |
+
"output_type": "stream",
|
128 |
+
"name": "stdout",
|
129 |
+
"text": [
|
130 |
+
"The default summarization model loaded is: sshleifer/distilbart-cnn-12-6\n",
|
131 |
+
"--- Original Story ---\n",
|
132 |
+
"\n",
|
133 |
+
"Born an orphan into the Hidden Leaf Village, Naruto Uzumaki's early life was shadowed by the terrifying Nine-Tailed Fox, a monstrous beast sealed within him. This secret led to him being ostracized and feared by the villagers, forcing a young Naruto to desperately seek attention and validation through pranks and a boisterous personality. His unwavering dream, however, was to become the Hokage, the village leader, a path he believed would finally earn him the respect and love he craved.\n",
|
134 |
+
"\n",
|
135 |
+
"His journey began with humble ninja training, forming Team 7 with the aloof Sasuke Uchiha, his rival and eventual best friend, and the intelligent Sakura Haruno, under the guidance of the enigmatic Kakashi Hatake. Early missions, like confronting Zabuza and Haku in the Land of Waves, forged bonds and revealed Naruto's hidden potential and fierce loyalty. As he grew, he faced numerous personal and global conflicts. The heart-wrenching pursuit of Sasuke, driven by revenge and Orochimaru's manipulation, became a central struggle, pushing Naruto to immense power, including mastering the Rasengan under the tutelage of his beloved mentor, Jiraiya. Jiraiya's tragic death at the hands of Pain, a former student, was a profound blow, yet it fueled Naruto's resolve, leading him to confront Pain and bring peace to the devastated Konoha, finally earning the villagers' acknowledgment and admiration.\n",
|
136 |
+
"\n",
|
137 |
+
"The Fourth Great Ninja War tested Naruto's strength and conviction to their limits. During this cataclysmic conflict, he confronted the harsh truths of his heritage, had a heart-touching conversation with his resurrected mother, Kushina, and fought alongside his father, Minato, the Fourth Hokage. His ultimate clash with Sasuke, a final, world-altering battle at the Valley of the End, brought their complex relationship to a poignant resolution. Through relentless effort, unwavering belief in his friends, and an extraordinary capacity for empathy that allowed him to change even the hearts of his enemies, Naruto eventually achieved his childhood dream. He became the Seventh Hokage, the revered protector and hero of Konohagakure, guiding a new generation and finally fulfilling his promise to himself and his village.\n",
|
138 |
+
"\n",
|
139 |
+
"\n",
|
140 |
+
"--- Default Summarizer Output ---\n",
|
141 |
+
" Naruto Uzumaki was born an orphan into the Hidden Leaf Village . His early life was shadowed by the terrifying Nine-Tailed Fox, a monstrous beast sealed within him . His unwavering dream was to become the Hokage, the village leader, a path he believed would earn him the respect and love he craved .\n"
|
142 |
+
]
|
143 |
+
}
|
144 |
+
]
|
145 |
+
},
|
146 |
+
{
|
147 |
+
"cell_type": "markdown",
|
148 |
+
"source": [
|
149 |
+
"---\n",
|
150 |
+
"\n",
|
151 |
+
"### 💡 Observation 1: Default Summarization Performance\n",
|
152 |
+
"\n",
|
153 |
+
"The default summarization pipeline (which internally uses a model like `sshleifer/distilbart-cnn-12-6`) produced a very concise summary.\n",
|
154 |
+
"\n",
|
155 |
+
"**Key observations:**\n",
|
156 |
+
"\n",
|
157 |
+
"* **Extreme Conciseness:** The model aggressively condensed the input, focusing on the absolute core narrative: Naruto's origin as an orphan with the Nine-Tails, his dream of becoming Hokage, and his eventual achievement of that goal.\n",
|
158 |
+
"* **Sensitivity to Initial Text / Abstractive & Title-Oriented:** Interestingly, when the initial descriptive line \"Naruto Uzumaki: From Outcast to Hokage\" was included at the very beginning of the input, the summary referred to the protagonist as \"The Seventh Hokage\" and omitted his name \"Naruto\". However, upon removing this initial line, the model *did* use \"Naruto\" by name. This suggests that the model gives significant weight to prominently placed introductory phrases or titles, using them to synthesize the primary identity of the subject. It prioritizes the *outcome* or *role* (Hokage) as the most salient identifier when provided with such a strong initial clue, aiming for maximum information density in a highly compressed output.\n",
|
159 |
+
"* **Information Omission:** Crucially, many significant details and character names (like Sasuke, Jiraiya, Sakura, Kakashi, Pain, the Great Ninja War, his parents) were entirely omitted. This is a direct consequence of the model's design for highly compressed summaries and its internal understanding of what constitutes \"essential\" information. While accurate, it lacks the richness of the original narrative.\n",
|
160 |
+
"\n",
|
161 |
+
"This initial test provides a valuable baseline, showing the model's ability to grasp the main arc of a complex story even without explicit parameters. However, it also highlights the need to control output length and consider task-specific fine-tuned models for richer, more detailed summaries, and how even subtle input formatting can influence the summary's focus."
|
162 |
+
],
|
163 |
+
"metadata": {
|
164 |
+
"id": "fx89eKESjmef"
|
165 |
+
}
|
166 |
+
},
|
167 |
+
{
|
168 |
+
"cell_type": "markdown",
|
169 |
+
"source": [
|
170 |
+
"---\n",
|
171 |
+
"### 📝 Summarization Experiment 2: Fine-Tuned Model with Parameters (English)\n",
|
172 |
+
"\n",
|
173 |
+
"Following our baseline test with the default summarization pipeline, we now shift our focus to a model specifically fine-tuned for text summarization: `Falconsai/text_summarization`. This model has demonstrated a stronger ability to capture and retain more granular details from narrative content compared to the default, making it a promising candidate for our English story. We will also explicitly set `max_length` and `min_length` parameters to gain more control over the summary's output size, aiming for a richer, yet still concise, summary.\n",
|
174 |
+
"\n",
|
175 |
+
"---"
|
176 |
+
],
|
177 |
+
"metadata": {
|
178 |
+
"id": "nn3wUnWTrIhq"
|
179 |
+
}
|
180 |
+
},
|
181 |
+
{
|
182 |
+
"cell_type": "code",
|
183 |
+
"source": [
|
184 |
+
"# Load the fine-tuned summarization model\n",
|
185 |
+
"summarizer = pipeline(\"summarization\", model=\"Falconsai/text_summarization\")\n",
|
186 |
+
"\n",
|
187 |
+
"# Experiment with increased max_length to get more detail\n",
|
188 |
+
"summarizer = summarizer(naruto_story, max_length=562, min_length=100, do_sample=False)\n",
|
189 |
+
"\n",
|
190 |
+
"print(\"\\n--- Fine-Tuned model on English Naruto Story ---\")\n",
|
191 |
+
"print(summarizer[0]['summary_text'])"
|
192 |
+
],
|
193 |
+
"metadata": {
|
194 |
+
"colab": {
|
195 |
+
"base_uri": "https://localhost:8080/"
|
196 |
+
},
|
197 |
+
"id": "JRb-11_krbJD",
|
198 |
+
"outputId": "a84734ca-1bc1-438e-865b-e5e88795624f"
|
199 |
+
},
|
200 |
+
"execution_count": 41,
|
201 |
+
"outputs": [
|
202 |
+
{
|
203 |
+
"output_type": "stream",
|
204 |
+
"name": "stderr",
|
205 |
+
"text": [
|
206 |
+
"Device set to use cpu\n",
|
207 |
+
"Token indices sequence length is longer than the specified maximum sequence length for this model (562 > 512). Running this sequence through the model will result in indexing errors\n",
|
208 |
+
"Both `max_new_tokens` (=256) and `max_length`(=562) seem to have been set. `max_new_tokens` will take precedence. Please refer to the documentation for more information. (https://huggingface.co/docs/transformers/main/en/main_classes/text_generation)\n"
|
209 |
+
]
|
210 |
+
},
|
211 |
+
{
|
212 |
+
"output_type": "stream",
|
213 |
+
"name": "stdout",
|
214 |
+
"text": [
|
215 |
+
"\n",
|
216 |
+
"--- Fine-Tuned model on English Naruto Story ---\n",
|
217 |
+
"Naruto Uzumaki was born an orphan into the Hidden Leaf Village . He became the Hokage, the revered protector and hero of Konohagakure . Through humble ninja training, he formed Team 7 with Sasuke Uchiha, his rival and eventual best friend, and the intelligent Sakura Haruno, under the guidance of the enigmatic Kakashi Hatake . As he grew, his journey forged bonds and revealed his hidden potential and fierce\n"
|
218 |
+
]
|
219 |
+
}
|
220 |
+
]
|
221 |
+
},
|
222 |
+
{
|
223 |
+
"cell_type": "markdown",
|
224 |
+
"source": [
|
225 |
+
"---\n",
|
226 |
+
"\n",
|
227 |
+
"### 💡 Observation 2: Performance of Specific Summarization Models (English Narrative)\n",
|
228 |
+
"\n",
|
229 |
+
"This section details the comparative performance of various summarization models on our English Naruto story, building upon the baseline established by the default pipeline. We aimed to identify models that offer a better balance of conciseness and detail, and that accurately capture the narrative's essence.\n",
|
230 |
+
"\n",
|
231 |
+
"Here's what we observed from the models tested:\n",
|
232 |
+
"\n",
|
233 |
+
"* **`facebook/bart-large-cnn`:**\n",
|
234 |
+
" * This model, a larger version of the default `distilbart`, produced a more verbose and generally coherent summary than the default. It successfully incorporated the protagonist's name, \"Naruto Uzumaki,\" right from the start.\n",
|
235 |
+
" * **However, a critical issue emerged: the model exhibited a factual inaccuracy by stating Naruto was the \"daughter of Kushina.\"** This is a prime example of \"hallucination,\" where an abstractive summarization model generates plausible-sounding but factually incorrect information. While generally powerful, this specific misattribution highlights the challenge of ensuring complete factual faithfulness in generated text, especially with fictional narratives which might not align perfectly with its general news-based training.\n",
|
236 |
+
"\n",
|
237 |
+
"* **`csebuetnlp/mT5_multilingual_XLSum`:**\n",
|
238 |
+
" * Despite its multilingual capabilities, this model performed poorly on the English Naruto story. **The output was largely \"made up,\" fabricating details not present in the original text** (e.g., \"northern Japanese village of Konoha in July,\" \"BBC's Nicholas Barber\").\n",
|
239 |
+
" * This severe hallucination and contextual irrelevance likely stem from a **domain mismatch**. The `XLSum` dataset, on which this model is fine-tuned, is predominantly composed of news articles. Consequently, the model attempted to summarize our fictional narrative as if it were a news report, imposing structures and factual elements characteristic of news. This strongly reinforces the importance of selecting models whose training data aligns with the domain of your input text. For this reason, we decided not to proceed further with this model for English narrative summarization.\n",
|
240 |
+
"\n",
|
241 |
+
"* **`Falconsai/text_summarization`:**\n",
|
242 |
+
" * This model, when given sufficient `max_length` and `min_length` parameters (`max_length=562, min_length=100, do_sample=False`), provided a very strong and detailed summary. It effectively included multiple key characters (Sasuke Uchiha, Sakura Haruno, Kakashi Hatake) and plot points (Team 7 formation, the pursuit of Sasuke) that were largely omitted by the more concise default model.\n",
|
243 |
+
" * While the summary sometimes appeared \"incomplete\" at the very end (\"...and fierce\"), this was a direct result of hitting the `max_length` limit mid-sentence, a common behavior when forcing longer outputs. By adjusting `max_length` further, one could likely mitigate this.\n",
|
244 |
+
"\n",
|
245 |
+
"**Conclusion on English Summarization Models:**\n",
|
246 |
+
"\n",
|
247 |
+
"Based on these experiments for English narrative summarization:\n",
|
248 |
+
"\n",
|
249 |
+
"* The **default `sshleifer/distilbart-cnn-12-6`** proved to be reliable for concise summaries, albeit with less detail.\n",
|
250 |
+
"* **`Falconsai/text_summarization`** stands out as the best performer for generating more comprehensive and accurate summaries of narrative content, successfully incorporating a richer set of details and character names. Its ability to summarize story elements more effectively makes it our preferred choice for this specific task.\n",
|
251 |
+
"\n",
|
252 |
+
"It's important to acknowledge that the landscape of pre-trained models on Hugging Face is constantly evolving. There are always new and potentially better models being released. Our observations are based on the models tested on **July 13, 2025**, and future models or different parameter configurations might yield even superior results. However, for the scope of this deep dive, `Falconsai/text_summarization` provides the most compelling performance for English narrative summarization.\n",
|
253 |
+
"\n",
|
254 |
+
"---\n"
|
255 |
+
],
|
256 |
+
"metadata": {
|
257 |
+
"id": "_OEU5Tpzrxcb"
|
258 |
+
}
|
259 |
+
},
|
260 |
+
{
|
261 |
+
"cell_type": "markdown",
|
262 |
+
"source": [
|
263 |
+
"### 📝 Summarization Experiment 3: Fine-Tuned Model with Arabic Narrative (Luffy Story)\n",
|
264 |
+
"\n",
|
265 |
+
"Having evaluated English summarization, we now pivot to a crucial challenge: summarizing Arabic narrative text. This requires models specifically trained on Arabic data. We will test `csebuetnlp/mT5_multilingual_XLSum`, a widely used multilingual model. Our aim is to assess how well it handles Arabic content, retains key details from a fictional story about Monkey D. Luffy, and produces coherent summaries in Modern Standard Arabic. We will also observe its response to `max_length` and `min_length` parameters, as we suspect some models have an inherent bias towards brevity."
|
266 |
+
],
|
267 |
+
"metadata": {
|
268 |
+
"id": "tVoGjFEDGIAJ"
|
269 |
+
}
|
270 |
+
},
|
271 |
+
{
|
272 |
+
"cell_type": "code",
|
273 |
+
"source": [
|
274 |
+
"import re\n",
|
275 |
+
"from transformers import AutoTokenizer, AutoModelForSeq2SeqLM\n",
|
276 |
+
"\n",
|
277 |
+
"WHITESPACE_HANDLER = lambda k: re.sub('\\s+', ' ', re.sub('\\n+', ' ', k.strip()))\n",
|
278 |
+
"\n",
|
279 |
+
"\n",
|
280 |
+
"model_name = \"csebuetnlp/mT5_multilingual_XLSum\"\n",
|
281 |
+
"tokenizer = AutoTokenizer.from_pretrained(model_name)\n",
|
282 |
+
"model = AutoModelForSeq2SeqLM.from_pretrained(model_name)\n",
|
283 |
+
"\n",
|
284 |
+
"input_ids = tokenizer(\n",
|
285 |
+
" [WHITESPACE_HANDLER(luffy_story_arabic)],\n",
|
286 |
+
" return_tensors=\"pt\",\n",
|
287 |
+
" padding=\"max_length\",\n",
|
288 |
+
" truncation=True,\n",
|
289 |
+
" max_length=512\n",
|
290 |
+
")[\"input_ids\"]\n",
|
291 |
+
"\n",
|
292 |
+
"output_ids = model.generate(\n",
|
293 |
+
" input_ids=input_ids,\n",
|
294 |
+
" max_length=512,\n",
|
295 |
+
" min_length=100,\n",
|
296 |
+
" no_repeat_ngram_size=2,\n",
|
297 |
+
" num_beams=4\n",
|
298 |
+
")[0]\n",
|
299 |
+
"\n",
|
300 |
+
"summary = tokenizer.decode(\n",
|
301 |
+
" output_ids,\n",
|
302 |
+
" skip_special_tokens=True,\n",
|
303 |
+
" clean_up_tokenization_spaces=False\n",
|
304 |
+
")\n",
|
305 |
+
"\n",
|
306 |
+
"print(summary)\n"
|
307 |
+
],
|
308 |
+
"metadata": {
|
309 |
+
"colab": {
|
310 |
+
"base_uri": "https://localhost:8080/"
|
311 |
+
},
|
312 |
+
"id": "TyuQ5o6d12jk",
|
313 |
+
"outputId": "9f1245e9-af6f-4693-fa48-a299b3254618"
|
314 |
+
},
|
315 |
+
"execution_count": 20,
|
316 |
+
"outputs": [
|
317 |
+
{
|
318 |
+
"output_type": "stream",
|
319 |
+
"name": "stderr",
|
320 |
+
"text": [
|
321 |
+
"/usr/local/lib/python3.11/dist-packages/transformers/convert_slow_tokenizer.py:564: UserWarning: The sentencepiece tokenizer that you are converting to a fast tokenizer uses the byte fallback option which is not implemented in the fast tokenizers. In practice this means that the fast version of the tokenizer can produce unknown tokens whereas the sentencepiece version would have converted these unknown tokens into a sequence of byte tokens matching the original piece of text.\n",
|
322 |
+
" warnings.warn(\n"
|
323 |
+
]
|
324 |
+
},
|
325 |
+
{
|
326 |
+
"output_type": "stream",
|
327 |
+
"name": "stdout",
|
328 |
+
"text": [
|
329 |
+
"يمثل مونكي دي لوفي رمزًا للحرية والإصرار الذي أثر في قلوب الملايين حول العالم، لكنه لم يكن مجرد قرصان عادي بل \"رمزٌ لحرية وإصدار\". الصحفية نيكولاس باربر تلقي الضوء على تأثير هذا الرجل في عالم \"ون بيس\" الخيالي.. . BBC عربي يلتقي فيه الشاب المثير للجدل.\n"
|
330 |
+
]
|
331 |
+
}
|
332 |
+
]
|
333 |
+
},
|
334 |
+
{
|
335 |
+
"cell_type": "markdown",
|
336 |
+
"source": [
|
337 |
+
"\n",
|
338 |
+
"---\n",
|
339 |
+
"\n",
|
340 |
+
"### 💡 Observation 3: Arabic Summarization Challenges\n",
|
341 |
+
"\n",
|
342 |
+
"Our journey into Arabic summarization has revealed significant challenges and underscored the importance of model selection, especially when dealing with specific language nuances and content domains. We further investigated the behavior of `csebuetnlp/mT5_multilingual_XLSum` by directly using its `generate` method and explicitly setting `min_length` to encourage a more detailed summary, aiming to overcome its previous brevity.\n",
|
343 |
+
"\n",
|
344 |
+
"Here are the key findings from our Arabic summarization tests:\n",
|
345 |
+
"\n",
|
346 |
+
"* **`csebuetnlp/mT5_multilingual_XLSum` (Re-tested with `min_length`):**\n",
|
347 |
+
" * When forced to generate a longer summary by setting `min_length=100` (within `model.generate()`), this model unfortunately also exhibited **hallucination issues**, similar to the other discarded models. It introduced fabricated details such as \"الصحفية نيكولاس باربر تلقي الضوء على تأثير هذا الرجل في عالم 'ون بيس' الخيالي.. . BBC عربي يلتقي فيه الشاب المثير للجدل.\" (Journalist Nicholas Barber sheds light on this man's impact in the fictional world of 'One Piece'... BBC Arabic meets the controversial young man).\n",
|
348 |
+
" * This clearly demonstrates that its propensity for hallucination is not simply due to brevity, but rather an inherent characteristic when applying it to narrative text, likely stemming from its training on news-focused datasets (XLSum). When pushed for more content, it defaults to generating information aligned with its primary domain.\n",
|
349 |
+
"\n",
|
350 |
+
"* **`eslamxm/mt5-base-finetuned-persian-finetuned-persian-arabic` & `ahmeddbahaa/mT5_multilingual_XLSum-finetuned-fa-finetuned-ar` (Models Previously Discarded):**\n",
|
351 |
+
" * As noted earlier, both of these models also consistently exhibited severe **hallucination issues** by generating factually incorrect or fabricated details (e.g., mentioning specific dates or external figures not present in the original story). This behavior reinforces their unsuitability for tasks demanding factual accuracy, especially outside their likely news-centric training domains.\n",
|
352 |
+
"\n",
|
353 |
+
"* **Default and English Models on Arabic Text:**\n",
|
354 |
+
" * As expected, attempting to use the default `sshleifer/distilbart-cnn-12-6` or `Falconsai/text_summarization` (which are fine-tuned primarily for English) on Arabic text resulted in uninterpretable or garbled outputs, confirming their lack of multilingual capability for Arabic.\n",
|
355 |
+
"\n",
|
356 |
+
"**Conclusion on Arabic Summarization Models (Current Date: July 13, 2025):**\n",
|
357 |
+
"\n",
|
358 |
+
"Our comprehensive testing of several \"Arabic-supporting\" summarization models reveals a significant challenge: finding a robust, off-the-shelf model capable of performing accurate and detailed **abstractive summarization of Arabic narrative text** without hallucination. All models tested that produced more than a single-sentence summary eventually resorted to generating fabricated information.\n",
|
359 |
+
"\n",
|
360 |
+
"This strongly suggests that for nuanced Arabic narrative summarization, relying solely on publicly available pre-trained models from the Hub *at this time* may lead to unreliable results, particularly when seeking detailed and factually faithful summaries. This could be a critical area where custom fine-tuning on a relevant Arabic narrative dataset might be necessary, or where larger, more general Arabic LLMs (used with careful prompting) might offer a solution if they become more accessible for fine-grained control. For the purpose of this deep dive, it highlights a current limitation in the readily available tooling for this specific task.\n",
|
361 |
+
"\n",
|
362 |
+
"---\n"
|
363 |
+
],
|
364 |
+
"metadata": {
|
365 |
+
"id": "7W-omkXY7WAA"
|
366 |
+
}
|
367 |
+
},
|
368 |
+
{
|
369 |
+
"cell_type": "markdown",
|
370 |
+
"source": [
|
371 |
+
"## 🌐 Machine Translation Deep Dive\n",
|
372 |
+
"\n",
|
373 |
+
"Having explored summarization, we now pivot to **Machine Translation (MT)**, a cornerstone of multilingual NLP. Our goal is to assess the capabilities of various pre-trained models on the Hugging Face Hub, focusing on English-to-Arabic and Arabic-to-English translation. A particular emphasis will be placed on understanding how these models handle both **Modern Standard Arabic (MSA - العربية الفصحى)** and the more challenging **Arabic dialects**, along with the common pitfall of \"Franco\" Arabic (romanized Arabic).\n",
|
374 |
+
"\n",
|
375 |
+
"We will directly test four prominent models, showcasing their output for both formal and dialectal sentences to highlight their respective strengths and limitations. This direct comparison will provide valuable insights into the current state of Arabic machine translation.\n",
|
376 |
+
"\n",
|
377 |
+
"---\n",
|
378 |
+
"\n",
|
379 |
+
"### Translation Experiment 1: `facebook/nllb-200-distilled-600M`\n",
|
380 |
+
"\n",
|
381 |
+
"This model is part of Meta AI's No Language Left Behind (NLLB) project, designed to provide high-quality translation for 200 languages. It's known for its broad coverage, including support for various Arabic dialects. We'll test its ability to translate both formal and dialectal Arabic to English, paying close attention to its handling of colloquialisms and informal text."
|
382 |
+
],
|
383 |
+
"metadata": {
|
384 |
+
"id": "7_byn_r1BEnc"
|
385 |
+
}
|
386 |
+
},
|
387 |
+
{
|
388 |
+
"cell_type": "code",
|
389 |
+
"source": [
|
390 |
+
"# Code for facebook/nllb-200-distilled-600M will go here\n",
|
391 |
+
"from transformers import pipeline\n",
|
392 |
+
"\n",
|
393 |
+
"# Example of how to use NLLB with specific language codes\n",
|
394 |
+
"# For Arabic (MSA) to English\n",
|
395 |
+
"translator_nllb_ara_en = pipeline(\"translation\", model=\"facebook/nllb-200-distilled-600M\", src_lang=\"ara_Arab\", tgt_lang=\"eng_Latn\")\n",
|
396 |
+
"print(\"--- NLLB (MSA Arabic to English) ---\")\n",
|
397 |
+
"print(translator_nllb_ara_en(\"كيف حالك يا صديقي؟ أتمنى أن تكون بخير.\"))\n",
|
398 |
+
"\n",
|
399 |
+
"# For Egyptian Arabic to English\n",
|
400 |
+
"print(\"\\n--- NLLB (Egyptian Arabic to English) ---\")\n",
|
401 |
+
"print(translator_nllb_ara_en(\"ياسطا انا تعبان\"))\n",
|
402 |
+
"print(translator_nllb_ara_en(\"هو انت عبيط ياسطا؟\"))"
|
403 |
+
],
|
404 |
+
"metadata": {
|
405 |
+
"colab": {
|
406 |
+
"base_uri": "https://localhost:8080/"
|
407 |
+
},
|
408 |
+
"id": "VAxCarDNB2Bn",
|
409 |
+
"outputId": "f16c563a-877a-4f5b-b487-c414656df31d"
|
410 |
+
},
|
411 |
+
"execution_count": 90,
|
412 |
+
"outputs": [
|
413 |
+
{
|
414 |
+
"output_type": "stream",
|
415 |
+
"name": "stderr",
|
416 |
+
"text": [
|
417 |
+
"Device set to use cpu\n"
|
418 |
+
]
|
419 |
+
},
|
420 |
+
{
|
421 |
+
"output_type": "stream",
|
422 |
+
"name": "stdout",
|
423 |
+
"text": [
|
424 |
+
"--- NLLB (MSA Arabic to English) ---\n",
|
425 |
+
"[{'translation_text': 'How are you, my friend?'}]\n",
|
426 |
+
"\n",
|
427 |
+
"--- NLLB (Egyptian Arabic to English) ---\n",
|
428 |
+
"[{'translation_text': \"Yasta, I'm tired of it.\"}]\n",
|
429 |
+
"[{'translation_text': 'Are you an abject Yasta?'}]\n"
|
430 |
+
]
|
431 |
+
}
|
432 |
+
]
|
433 |
+
},
|
434 |
+
{
|
435 |
+
"cell_type": "markdown",
|
436 |
+
"source": [
|
437 |
+
"---\n",
|
438 |
+
"\n",
|
439 |
+
"### Translation Experiment 2: `Helsinki-NLP/opus-mt-ar-en`\n",
|
440 |
+
"\n",
|
441 |
+
"This model is part of the OPUS-MT project, renowned for providing pre-trained models for a vast array of language pairs, often trained on parallel corpora from the OPUS project. This specific model is fine-tuned for Arabic-to-English translation. We will examine its performance on both formal and dialectal Arabic inputs, observing its fluency and accuracy, especially in contrast to NLLB's dialect handling.\n",
|
442 |
+
"\n",
|
443 |
+
"---"
|
444 |
+
],
|
445 |
+
"metadata": {
|
446 |
+
"id": "g_dXPawgetut"
|
447 |
+
}
|
448 |
+
},
|
449 |
+
{
|
450 |
+
"cell_type": "code",
|
451 |
+
"source": [
|
452 |
+
"# Code for Helsinki-NLP/opus-mt-ar-en will go here\n",
|
453 |
+
"\n",
|
454 |
+
"translator_opus_ar_en = pipeline(\"translation\", model=\"Helsinki-NLP/opus-mt-ar-en\")\n",
|
455 |
+
"\n",
|
456 |
+
"print(\"--- OPUS-MT (Arabic to English) ---\")\n",
|
457 |
+
"print(translator_opus_ar_en(\"كيف حالك يا صديقي؟ أتمنى أن تكون بخير.\"))\n",
|
458 |
+
"print(translator_opus_ar_en(\"ياسطا انا تعبان\"))\n",
|
459 |
+
"print(translator_opus_ar_en(\"هو انت عبيط ياسطا؟\"))"
|
460 |
+
],
|
461 |
+
"metadata": {
|
462 |
+
"colab": {
|
463 |
+
"base_uri": "https://localhost:8080/"
|
464 |
+
},
|
465 |
+
"id": "hdUfD7eDLc1W",
|
466 |
+
"outputId": "d745ac68-f5de-455e-e83e-6519667f3439"
|
467 |
+
},
|
468 |
+
"execution_count": 89,
|
469 |
+
"outputs": [
|
470 |
+
{
|
471 |
+
"output_type": "stream",
|
472 |
+
"name": "stderr",
|
473 |
+
"text": [
|
474 |
+
"Device set to use cpu\n"
|
475 |
+
]
|
476 |
+
},
|
477 |
+
{
|
478 |
+
"output_type": "stream",
|
479 |
+
"name": "stdout",
|
480 |
+
"text": [
|
481 |
+
"--- OPUS-MT (Arabic to English) ---\n",
|
482 |
+
"[{'translation_text': 'How you doing, buddy?'}]\n",
|
483 |
+
"[{'translation_text': \"I'm tired.\"}]\n",
|
484 |
+
"[{'translation_text': \"You're a jackass, aren't you?\"}]\n"
|
485 |
+
]
|
486 |
+
}
|
487 |
+
]
|
488 |
+
},
|
489 |
+
{
|
490 |
+
"cell_type": "markdown",
|
491 |
+
"source": [
|
492 |
+
"---\n",
|
493 |
+
"\n",
|
494 |
+
"### Translation Experiment 3: `Helsinki-NLP/opus-mt-en-ar`\n",
|
495 |
+
"\n",
|
496 |
+
"Complementing the previous OPUS-MT model, this one specializes in English-to-Arabic translation. We will test its capabilities for translating English sentences into Modern Standard Arabic, with a focus on its accuracy and completeness, noting any instances where it might struggle with specific sentence structures or nuances.\n",
|
497 |
+
"\n",
|
498 |
+
"----"
|
499 |
+
],
|
500 |
+
"metadata": {
|
501 |
+
"id": "jJ9EsH-NfwA3"
|
502 |
+
}
|
503 |
+
},
|
504 |
+
{
|
505 |
+
"cell_type": "code",
|
506 |
+
"source": [
|
507 |
+
"# Code for Helsinki-NLP/opus-mt-en-ar will go here\n",
|
508 |
+
"\n",
|
509 |
+
"translator_opus_en_ar = pipeline(\"translation\", model=\"Helsinki-NLP/opus-mt-en-ar\")\n",
|
510 |
+
"\n",
|
511 |
+
"print(\"--- OPUS-MT (English to Arabic) ---\")\n",
|
512 |
+
"print(translator_opus_en_ar(\"How are you, my friend? I hope you're okay.\"))\n"
|
513 |
+
],
|
514 |
+
"metadata": {
|
515 |
+
"colab": {
|
516 |
+
"base_uri": "https://localhost:8080/"
|
517 |
+
},
|
518 |
+
"id": "qbEN3AyFYQB_",
|
519 |
+
"outputId": "af234f67-201b-4ed4-9d2a-e11802a5d876"
|
520 |
+
},
|
521 |
+
"execution_count": 91,
|
522 |
+
"outputs": [
|
523 |
+
{
|
524 |
+
"output_type": "stream",
|
525 |
+
"name": "stderr",
|
526 |
+
"text": [
|
527 |
+
"Device set to use cpu\n"
|
528 |
+
]
|
529 |
+
},
|
530 |
+
{
|
531 |
+
"output_type": "stream",
|
532 |
+
"name": "stdout",
|
533 |
+
"text": [
|
534 |
+
"--- OPUS-MT (English to Arabic) ---\n",
|
535 |
+
"[{'translation_text': '-آمل أنّك بخير .'}]\n"
|
536 |
+
]
|
537 |
+
}
|
538 |
+
]
|
539 |
+
},
|
540 |
+
{
|
541 |
+
"cell_type": "markdown",
|
542 |
+
"source": [
|
543 |
+
"---\n",
|
544 |
+
"\n",
|
545 |
+
"### Translation Experiment 4: `Helsinki-NLP/opus-mt-mul-en`\n",
|
546 |
+
"\n",
|
547 |
+
"This multilingual OPUS-MT model is designed to translate from various source languages (including Arabic) to English. We'll examine its general robustness and compare its performance, particularly on dialectal Arabic, against the dedicated `opus-mt-ar-en` and the NLLB model, to see if its broader multilingual training offers any advantages or different failure modes.\n",
|
548 |
+
"\n",
|
549 |
+
"---"
|
550 |
+
],
|
551 |
+
"metadata": {
|
552 |
+
"id": "ZBh43YehgBsA"
|
553 |
+
}
|
554 |
+
},
|
555 |
+
{
|
556 |
+
"cell_type": "code",
|
557 |
+
"source": [
|
558 |
+
"# Code for Helsinki-NLP/opus-mt-mul-en will go here\n",
|
559 |
+
"\n",
|
560 |
+
"# For multilingual to English, source language can sometimes be auto-detected or specified\n",
|
561 |
+
"# Here, we assume it can handle Arabic input.\n",
|
562 |
+
"translator_opus_mul_en = pipeline(\"translation\", model=\"Helsinki-NLP/opus-mt-mul-en\")\n",
|
563 |
+
"\n",
|
564 |
+
"print(\"--- OPUS-MT (Multilingual to English) ---\")\n",
|
565 |
+
"print(translator_opus_mul_en(\"كيف حالك يا صديقي؟ أتمنى أن تكون بخير.\"))\n",
|
566 |
+
"print(translator_opus_mul_en(\"ياسطا انا تعبان\"))\n"
|
567 |
+
],
|
568 |
+
"metadata": {
|
569 |
+
"colab": {
|
570 |
+
"base_uri": "https://localhost:8080/"
|
571 |
+
},
|
572 |
+
"id": "44YxtgVFMp2W",
|
573 |
+
"outputId": "91127ea1-da6f-4f78-b913-6e4421199e52"
|
574 |
+
},
|
575 |
+
"execution_count": 92,
|
576 |
+
"outputs": [
|
577 |
+
{
|
578 |
+
"output_type": "stream",
|
579 |
+
"name": "stderr",
|
580 |
+
"text": [
|
581 |
+
"Device set to use cpu\n"
|
582 |
+
]
|
583 |
+
},
|
584 |
+
{
|
585 |
+
"output_type": "stream",
|
586 |
+
"name": "stdout",
|
587 |
+
"text": [
|
588 |
+
"--- OPUS-MT (Multilingual to English) ---\n",
|
589 |
+
"[{'translation_text': \"How are you, buddy? I hope you're okay.\"}]\n",
|
590 |
+
"[{'translation_text': \"Oh, my God. I'm an asshole.\"}]\n"
|
591 |
+
]
|
592 |
+
}
|
593 |
+
]
|
594 |
+
},
|
595 |
+
{
|
596 |
+
"cell_type": "markdown",
|
597 |
+
"source": [
|
598 |
+
"---\n",
|
599 |
+
"\n",
|
600 |
+
"### 💡 Observation 4: Machine Translation Performance Across Formal and Dialectal Arabic\n",
|
601 |
+
"\n",
|
602 |
+
"Our exploration into Machine Translation revealed varying degrees of success across different models, particularly highlighting the persistent challenge of handling Arabic dialects compared to Modern Standard Arabic (MSA).\n",
|
603 |
+
"\n",
|
604 |
+
"Here's a summary of our findings for each model:\n",
|
605 |
+
"\n",
|
606 |
+
"* **`facebook/nllb-200-distilled-600M`**:\n",
|
607 |
+
" * **Formal Arabic (AR to EN & EN to AR)**: Performed well, providing accurate and fluent translations for Modern Standard Arabic sentences in both directions, especially when `src_lang` and `tgt_lang` were explicitly set with the correct NLLB language codes (e.g., `ara_Arab`, `eng_Latn`).\n",
|
608 |
+
" * **Dialectal Arabic (AR to EN)**: Showed a unique and interesting behavior. While it struggled with direct, fluent translations of complex dialectal sentences, it demonstrated an awareness of colloquial terms. For example, \"ياسطا انا تعبان\" (Yasta ana ta'ban - Hey man, I'm tired) was often transliterated as \"Yasta I am tired\" rather than a full English translation. This 'Franco' Arabic (Arabic words written with Latin characters) output, while not a perfect translation, indicates the model's exposure to and recognition of informal, real-world Arabic usage, which is a notable capability. When presented with more complex or highly dialectal phrases, it sometimes struggled to produce coherent translations.\n",
|
609 |
+
" * **Initial Quirk:** It initially showed a tendency to translate to English by default, even when parameters were set, suggesting that explicit language code usage is crucial for consistent behavior.\n",
|
610 |
+
"\n",
|
611 |
+
"* **`Helsinki-NLP/opus-mt-ar-en`**:\n",
|
612 |
+
" * **Formal Arabic (AR to EN)**: Generally good, producing intelligible translations. However, it sometimes exhibited conciseness, occasionally omitting parts of longer, grammatically correct sentences (e.g., shortening \"How are you, my friend? I hope you're okay.\" to \"How you doing, buddy?\"). This suggests a tendency towards brevity or a potential limitation in capturing full semantic content consistently.\n",
|
613 |
+
" * **Dialectal Arabic (AR to EN)**: Similar to many models, it struggled significantly with dialectal phrases. While it attempted translations that made some sense (e.g., \"You're a jackass, aren't you?\" for \"هو انت عبيط ياسطا؟\"), it often failed to accurately capture or fully translate highly colloquial words or slang, often opting for more generalized or formal equivalents, if any.\n",
|
614 |
+
"\n",
|
615 |
+
"* **`Helsinki-NLP/opus-mt-en-ar`**:\n",
|
616 |
+
" * **Formal Arabic (EN to AR)**: This model showed a surprising and significant weakness in the English-to-Arabic direction. It notably failed to translate entire parts of formal English sentences (e.g., \"How are you, my friend? I hope you're okay.\" translated to only \"-آمل أنّك بخير .\"), rendering the output incomplete and grammatically incorrect. This makes it unreliable for robust EN-to-AR translation.\n",
|
617 |
+
"\n",
|
618 |
+
"* **`Helsinki-NLP/opus-mt-mul-en`**:\n",
|
619 |
+
" * **Formal Arabic (AR to EN)**: Handled formal Arabic to English correctly, indicating its general multilingual capability for standard languages.\n",
|
620 |
+
" * **Dialectal Arabic (AR to EN)**: Similar to other non-NLLB models, it largely failed on dialectal Arabic, producing translations that were often unrelated to the original input. Its broader multilingual training did not seem to equip it with a nuanced understanding of Arabic dialects.\n",
|
621 |
+
"\n",
|
622 |
+
"**Overall Conclusion on Machine Translation:**\n",
|
623 |
+
"\n",
|
624 |
+
"Our tests confirm that while **Modern Standard Arabic (العربية الفصحى) translation is reasonably well-supported by several models** (with `facebook/nllb-200-distilled-600M` and `Helsinki-NLP/opus-mt-ar-en` performing commendably in AR-to-EN, and NLLB being strong in EN-to-AR), **translating Arabic dialects remains a significant challenge for publicly available, general-purpose models.**\n",
|
625 |
+
"\n",
|
626 |
+
"The `facebook/nllb-200-distilled-600M` model, despite requiring precise language code specification, emerged as the most promising for its unique (though imperfect) ability to recognize and transliterate certain dialectal terms. This suggests NLLB's broader dataset encompasses more real-world, informal Arabic, setting it apart from the OPUS-MT models that tend to lean heavily on formal language.\n",
|
627 |
+
"\n",
|
628 |
+
"For highly accurate and nuanced dialectal Arabic translation, specialized fine-tuning on relevant dialectal datasets or the use of larger, more comprehensively trained multimodal LLMs might be necessary. However, within the confines of readily accessible pre-trained models on the Hugging Face Hub, NLLB stands out for its potential in this complex domain.\n",
|
629 |
+
"\n",
|
630 |
+
"---"
|
631 |
+
],
|
632 |
+
"metadata": {
|
633 |
+
"id": "Rr6OT8p9g1n6"
|
634 |
+
}
|
635 |
+
}
|
636 |
+
]
|
637 |
+
}
|