Musno commited on
Commit
690ad61
·
1 Parent(s): 19301c9

📝 Day 4 – Multilingual NLP Pipelines (EN + AR) – QA, NER, Fill‑Mask, Generation

Browse files

Comprehensive evaluation of four core NLP pipelines—Question Answering, Named Entity Recognition, Fill‑Mask, and Text Generation—across English, Modern Standard Arabic, and Egyptian dialect.

🔍 Highlights:

Arabic QA with mdeberta-v3-base-squad2 shows exceptional dialectal sensitivity

NER with CamelBERT achieves high accuracy on informal text

Multilingual fill-mask surprises with strong dialect completions

Arabic text generation tested with large-scale LFM2-1.2B on Kaggle

Each section includes observations, edge cases, and model behavior insights

🚀 Looking ahead: Day 5 will explore Image, Audio, and potentially multimodal pipelines.

Files changed (3) hide show
  1. README.md +2 -1
  2. logs/day4.md +120 -0
  3. notebooks/day4.ipynb +0 -0
README.md CHANGED
@@ -25,7 +25,8 @@ To deepen my understanding of Gen AI, complete the Hugging Face course, build re
25
  | 1 | First HF pipelines | [Colab](https://colab.research.google.com/drive/1ysW0sQq01mI9o5uVyaLMM5oCT3pDI41e?usp=sharing) / [Repo](notebooks/day1.ipynb) | [Day 1 Log](logs/day1.md) |
26
  | 2 | Beyond Default | [Colab](https://colab.research.google.com/drive/1h9AC5_Oe5eXtD0zkHdPo64aHG_9hapwD?usp=sharing) / [Repo](notebooks/day2.ipynb) | [Day 2 Log](logs/day2.md) |
27
  | 3 | Summarization & Translation Deep Dive | [Colab](https://colab.research.google.com/drive/1CuD1NErkmrTRebbnXnFl5tLxG-A0NHnD?usp=sharing) / [Repo](notebooks/day3.ipynb) | [Day 3 Log](logs/day3.md) |
28
- | 4 | ... coming soon... | - | - |
 
29
 
30
 
31
  ## đź”§ Tech Stack
 
25
  | 1 | First HF pipelines | [Colab](https://colab.research.google.com/drive/1ysW0sQq01mI9o5uVyaLMM5oCT3pDI41e?usp=sharing) / [Repo](notebooks/day1.ipynb) | [Day 1 Log](logs/day1.md) |
26
  | 2 | Beyond Default | [Colab](https://colab.research.google.com/drive/1h9AC5_Oe5eXtD0zkHdPo64aHG_9hapwD?usp=sharing) / [Repo](notebooks/day2.ipynb) | [Day 2 Log](logs/day2.md) |
27
  | 3 | Summarization & Translation Deep Dive | [Colab](https://colab.research.google.com/drive/1CuD1NErkmrTRebbnXnFl5tLxG-A0NHnD?usp=sharing) / [Repo](notebooks/day3.ipynb) | [Day 3 Log](logs/day3.md) |
28
+ | 4 | Expanding NLP Horizons | [Colab](https://colab.research.google.com/drive/1sFWhIznoMSd_RjoNk0c1bGNPUBgEnQz-?usp=sharing) / [Repo](notebooks/day4.ipynb) | [Day 4 Log](logs/day4.md) |
29
+ | 5 | ... coming soon... | - | - |
30
 
31
 
32
  ## đź”§ Tech Stack
logs/day4.md ADDED
@@ -0,0 +1,120 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ## Day 4: Multilingual NLP Pipelines
2
+
3
+ Today’s notebook covers four core NLP tasks—**Question Answering**, **Named Entity Recognition**, **Fill‑Mask**, and **Text Generation**—evaluated in both **English** and **Arabic** (MSA & Egyptian dialect). For each pipeline, we compare general-purpose and language-specific models to assess real-world performance.
4
+
5
+ ---
6
+
7
+ ### 1. Question Answering (QA)
8
+
9
+ **Goal:** Extract precise answers from context based on a question, measuring accuracy, phrasing robustness, and dialect handling.
10
+
11
+ ```python
12
+ from transformers import pipeline
13
+ qa_default = pipeline("question-answering")
14
+ qa_finetuned = pipeline(
15
+ "question-answering", model="timpal0l/mdeberta-v3-base-squad2"
16
+ )
17
+ ```
18
+
19
+ #### Observations
20
+
21
+ - **English:** Both models answered direct queries (e.g., author pseudonym) accurately. For inference questions, the default model returned an action span, while mDeBERTa captured the implied effect more closely, though both remained extractive.
22
+
23
+ - **Arabic (MSA & Dialect):** The fine-tuned model excelled—responding correctly to subtle rephrasings in MSA and demonstrating orthographic sensitivity and accurate extraction in informal Egyptian contexts.
24
+
25
+
26
+ **Conclusion:** `mdeberta-v3-base-squad2` is robust across languages and dialects, albeit limited by extractive span selection rather than deep reasoning.
27
+
28
+ ---
29
+
30
+ ### 2. Named Entity Recognition (NER)
31
+
32
+ **Goal:** Identify `PERSON`, `ORG`, `LOC`, `DATE`, etc., evaluating general vs. specialized models.
33
+
34
+ ```python
35
+ from transformers import pipeline
36
+ ner_default = pipeline("ner")
37
+ ner_multi = pipeline(
38
+ "ner", model="Davlan/bert-base-multilingual-cased-ner-hrl"
39
+ )
40
+ ner_arabic = pipeline(
41
+ "ner", model="CAMeL-Lab/bert-base-arabic-camelbert-mix-ner"
42
+ )
43
+ ```
44
+
45
+ #### Observations
46
+
47
+ - **English:** Both models detected common entities but missed dates and monetary values. The default pipeline mishandled long entity spans due to tokenization boundaries.
48
+
49
+ - **Arabic (MSA):** CamelBERT accurately tagged organizations and locations, with improvements when context prepositions were explicit.
50
+
51
+ - **Arabic (Dialect):** CamelBERT excelled on colloquial text, correctly extracting names and places from highly informal sentences.
52
+
53
+
54
+ **Takeaway:** Language-specific NER models (e.g., CamelBERT) outperform general pipelines, especially for morphologically rich or dialectal text.
55
+
56
+ ---
57
+
58
+ ### 3. Fill‑Mask (Masked LM)
59
+
60
+ **Goal:** Predict missing tokens in a sequence, focusing on literal vs. idiomatic and dialectal completions.
61
+
62
+ ```python
63
+ from transformers import pipeline
64
+ fm_en = pipeline("fill-mask", model="bert-base-uncased")
65
+ fm_multi = pipeline("fill-mask", model="xlm-roberta-base")
66
+ ```
67
+
68
+ #### Observations
69
+
70
+ - **English:** BERT variants generated sensible completions ("mat", "floor", etc.) for straightforward sentences.
71
+
72
+ - **Arabic (MSA):** Most models struggled with idiomatic proverbs (expected "صديق"), though XLM-R offered one correct suggestion among its top predictions.
73
+
74
+ - **Arabic (Dialect):** XLM-R accurately completed colloquial sentences ("مدينة"), highlighting strong pattern learning for common informal structures.
75
+
76
+
77
+ **Key Insight:** Models excel at literal patterns but falter on fixed idioms. Dialect predictions benefit from frequent informal patterns in pretraining data.
78
+
79
+ ---
80
+
81
+ ### 4. Text Generation
82
+
83
+ **Goal:** Generate coherent, contextually relevant text from a prompt, evaluating fluency, creativity, and resource demands.
84
+
85
+ ```python
86
+ from transformers import pipeline
87
+ text_gen = pipeline("text-generation", model="gpt2")
88
+ text_gen_ar = pipeline("text-generation", model="LiquidAI/LFM2-1.2B")
89
+ ```
90
+
91
+ #### Observations
92
+
93
+ - **English:** GPT-2 and mGPT produce coherent short passages but require parameter tuning (`temperature`, `top_k`) to mitigate grammatical issues and repetition.
94
+
95
+ - **Arabic:** The 1.2B-parameter LFM2 model generated fluent, coherent Arabic but demanded substantial GPU memory, necessitating high‑capacity environments.
96
+
97
+
98
+ **Conclusion:** While smaller models are accessible for English, high-quality Arabic generation relies on larger models and robust infrastructure.
99
+
100
+ ---
101
+
102
+ ## âś… Final Summary
103
+
104
+ - **Question Answering:** Extractive models perform well on explicit facts; fine-tuned mDeBERTa handles dialects and phrasing nuances best.
105
+
106
+ - **NER:** Specialized models like CamelBERT are essential for accurate entity extraction in Arabic and dialects.
107
+
108
+ - **Fill‑Mask:** Literal tasks succeed broadly; idiomatic understanding requires further fine-tuning on proverb datasets.
109
+
110
+ - **Text Generation:** English generation is lightweight; Arabic generation is resource-intensive but yields strong outputs with dedicated models.
111
+
112
+
113
+
114
+ > **Key Insight:** Multilingual pipelines offer powerful baselines, but **task-specific and language-specific fine-tuning** is crucial for morphologically complex and dialect-rich languages.
115
+
116
+ ---
117
+
118
+ ## đź”­ Vision for Day 5
119
+
120
+ Tomorrow, we will expand our exploration to **Image and Audio** pipelines—such as object detection, image captioning, and automatic speech recognition—and potentially venture into **multimodal** and **graph-based** NLP tasks as resources permit.
notebooks/day4.ipynb ADDED
The diff for this file is too large to render. See raw diff