📝 Day 4 – Multilingual NLP Pipelines (EN + AR) – QA, NER, Fill‑Mask, Generation

Comprehensive evaluation of four core NLP pipelines—Question Answering, Named Entity Recognition, Fill‑Mask, and Text Generation—across English, Modern Standard Arabic, and Egyptian dialect.

🔍 Highlights:

Arabic QA with mdeberta-v3-base-squad2 shows exceptional dialectal sensitivity

NER with CamelBERT achieves high accuracy on informal text

Multilingual fill-mask surprises with strong dialect completions

Arabic text generation tested with large-scale LFM2-1.2B on Kaggle

Each section includes observations, edge cases, and model behavior insights

🚀 Looking ahead: Day 5 will explore Image, Audio, and potentially multimodal pipelines.

Files changed (3) hide show

README.md +2 -1
logs/day4.md +120 -0
notebooks/day4.ipynb +0 -0

README.md CHANGED Viewed

@@ -25,7 +25,8 @@ To deepen my understanding of Gen AI, complete the Hugging Face course, build re
 | 1 | First HF pipelines | [Colab](https://colab.research.google.com/drive/1ysW0sQq01mI9o5uVyaLMM5oCT3pDI41e?usp=sharing) / [Repo](notebooks/day1.ipynb) | [Day 1 Log](logs/day1.md) |
 | 2   | Beyond Default | [Colab](https://colab.research.google.com/drive/1h9AC5_Oe5eXtD0zkHdPo64aHG_9hapwD?usp=sharing) / [Repo](notebooks/day2.ipynb) | [Day 2 Log](logs/day2.md) |
 | 3   | Summarization & Translation Deep Dive | [Colab](https://colab.research.google.com/drive/1CuD1NErkmrTRebbnXnFl5tLxG-A0NHnD?usp=sharing) / [Repo](notebooks/day3.ipynb) | [Day 3 Log](logs/day3.md) |
-| 4   | ... coming soon... | - | - |
 ## 🔧 Tech Stack

 | 1 | First HF pipelines | [Colab](https://colab.research.google.com/drive/1ysW0sQq01mI9o5uVyaLMM5oCT3pDI41e?usp=sharing) / [Repo](notebooks/day1.ipynb) | [Day 1 Log](logs/day1.md) |
 | 2   | Beyond Default | [Colab](https://colab.research.google.com/drive/1h9AC5_Oe5eXtD0zkHdPo64aHG_9hapwD?usp=sharing) / [Repo](notebooks/day2.ipynb) | [Day 2 Log](logs/day2.md) |
 | 3   | Summarization & Translation Deep Dive | [Colab](https://colab.research.google.com/drive/1CuD1NErkmrTRebbnXnFl5tLxG-A0NHnD?usp=sharing) / [Repo](notebooks/day3.ipynb) | [Day 3 Log](logs/day3.md) |
+| 4   | Expanding NLP Horizons | [Colab](https://colab.research.google.com/drive/1sFWhIznoMSd_RjoNk0c1bGNPUBgEnQz-?usp=sharing) / [Repo](notebooks/day4.ipynb) | [Day 4 Log](logs/day4.md) |
+| 5   | ... coming soon... | - | - |
 ## 🔧 Tech Stack

logs/day4.md ADDED Viewed

	@@ -0,0 +1,120 @@

+## Day 4: Multilingual NLP Pipelines
+Today’s notebook covers four core NLP tasks—**Question Answering**, **Named Entity Recognition**, **Fill‑Mask**, and **Text Generation**—evaluated in both **English** and **Arabic** (MSA & Egyptian dialect). For each pipeline, we compare general-purpose and language-specific models to assess real-world performance.
+---
+### 1. Question Answering (QA)
+**Goal:** Extract precise answers from context based on a question, measuring accuracy, phrasing robustness, and dialect handling.
+```python
+from transformers import pipeline
+qa_default = pipeline("question-answering")
+qa_finetuned = pipeline(
+    "question-answering", model="timpal0l/mdeberta-v3-base-squad2"
+)
+```
+#### Observations
+- **English:** Both models answered direct queries (e.g., author pseudonym) accurately. For inference questions, the default model returned an action span, while mDeBERTa captured the implied effect more closely, though both remained extractive.
+- **Arabic (MSA & Dialect):** The fine-tuned model excelled—responding correctly to subtle rephrasings in MSA and demonstrating orthographic sensitivity and accurate extraction in informal Egyptian contexts.
+**Conclusion:** `mdeberta-v3-base-squad2` is robust across languages and dialects, albeit limited by extractive span selection rather than deep reasoning.
+---
+### 2. Named Entity Recognition (NER)
+**Goal:** Identify `PERSON`, `ORG`, `LOC`, `DATE`, etc., evaluating general vs. specialized models.
+```python
+from transformers import pipeline
+ner_default = pipeline("ner")
+ner_multi = pipeline(
+    "ner", model="Davlan/bert-base-multilingual-cased-ner-hrl"
+)
+ner_arabic = pipeline(
+    "ner", model="CAMeL-Lab/bert-base-arabic-camelbert-mix-ner"
+)
+```
+#### Observations
+- **English:** Both models detected common entities but missed dates and monetary values. The default pipeline mishandled long entity spans due to tokenization boundaries.
+- **Arabic (MSA):** CamelBERT accurately tagged organizations and locations, with improvements when context prepositions were explicit.
+- **Arabic (Dialect):** CamelBERT excelled on colloquial text, correctly extracting names and places from highly informal sentences.
+**Takeaway:** Language-specific NER models (e.g., CamelBERT) outperform general pipelines, especially for morphologically rich or dialectal text.
+---
+### 3. Fill‑Mask (Masked LM)
+**Goal:** Predict missing tokens in a sequence, focusing on literal vs. idiomatic and dialectal completions.
+```python
+from transformers import pipeline
+fm_en = pipeline("fill-mask", model="bert-base-uncased")
+fm_multi = pipeline("fill-mask", model="xlm-roberta-base")
+```
+#### Observations
+- **English:** BERT variants generated sensible completions ("mat", "floor", etc.) for straightforward sentences.
+- **Arabic (MSA):** Most models struggled with idiomatic proverbs (expected "صديق"), though XLM-R offered one correct suggestion among its top predictions.
+- **Arabic (Dialect):** XLM-R accurately completed colloquial sentences ("مدينة"), highlighting strong pattern learning for common informal structures.
+**Key Insight:** Models excel at literal patterns but falter on fixed idioms. Dialect predictions benefit from frequent informal patterns in pretraining data.
+---
+### 4. Text Generation
+**Goal:** Generate coherent, contextually relevant text from a prompt, evaluating fluency, creativity, and resource demands.
+```python
+from transformers import pipeline
+text_gen = pipeline("text-generation", model="gpt2")
+text_gen_ar = pipeline("text-generation", model="LiquidAI/LFM2-1.2B")
+```
+#### Observations
+- **English:** GPT-2 and mGPT produce coherent short passages but require parameter tuning (`temperature`, `top_k`) to mitigate grammatical issues and repetition.
+- **Arabic:** The 1.2B-parameter LFM2 model generated fluent, coherent Arabic but demanded substantial GPU memory, necessitating high‑capacity environments.
+**Conclusion:** While smaller models are accessible for English, high-quality Arabic generation relies on larger models and robust infrastructure.
+---
+## ✅ Final Summary
+- **Question Answering:** Extractive models perform well on explicit facts; fine-tuned mDeBERTa handles dialects and phrasing nuances best.
+- **NER:** Specialized models like CamelBERT are essential for accurate entity extraction in Arabic and dialects.
+- **Fill‑Mask:** Literal tasks succeed broadly; idiomatic understanding requires further fine-tuning on proverb datasets.
+- **Text Generation:** English generation is lightweight; Arabic generation is resource-intensive but yields strong outputs with dedicated models.
+> **Key Insight:** Multilingual pipelines offer powerful baselines, but **task-specific and language-specific fine-tuning** is crucial for morphologically complex and dialect-rich languages.
+---
+## 🔭 Vision for Day 5
+Tomorrow, we will expand our exploration to **Image and Audio** pipelines—such as object detection, image captioning, and automatic speech recognition—and potentially venture into **multimodal** and **graph-based** NLP tasks as resources permit.

notebooks/day4.ipynb ADDED Viewed

The diff for this file is too large to render. See raw diff