# ✍️ **Day 03 – Deep Dive into Summarization & Translation with Hugging Face 🤗** Today marks our exploration of two advanced generative NLP tasks beyond classification: **Text Summarization** and **Machine Translation**. We’ll compare default Hugging Face pipelines to language-specific models, emphasizing Arabic. --- ## 📝 1. Text Summarization ### 1.1 Overview - **Baseline:** Default `summarization` pipeline on English narrative. - **English-focused models:** Compare `facebook/bart-large-cnn`, `csebuetnlp/mT5_multilingual_XLSum`, and `Falconsai/text_summarization` with length parameters. - **Arabic narrative:** Assess multilingual vs. Arabic-specialized models on Arabic text. ### 1.2 Experiment 1: Default Pipeline on English Narrative ```python from transformers import pipeline summarizer = pipeline("summarization") # This is a short example naruto_story = """ Born an orphan into the Hidden Leaf Village, Naruto Uzumaki's early life was shadowed by the terrifying Nine-Tailed Fox, a monstrous beast sealed within him.""" # Generate the summary summary_default = summarizer(naruto_story) # Print the result print("--- Original Story ---") print(naruto_story) print("\n--- Default Summarizer Output ---") print(summary_default[0]['summary_text']) ``` - **Model:** `sshleifer/distilbart-cnn-12-6` (default). - **Input:** A Naruto Uzumaki story (with/without an initial title line). **Key Observations:** 1. **Conciseness:** The summary distilled only the core arc (orphan → Hokage). 2. **Title Sensitivity:** With the title line present, the model labeled Naruto as “The Seventh Hokage” and omitted his name; removing the title restored “Naruto.” 3. **Omission of Details:** Side characters (Sasuke, Jiraiya, etc.) and subplots were dropped due to aggressive compression. > **Insight:** Useful for quick overviews but lacks narrative richness and requires parameter tuning or fine-tuned models for detail retention. ### 1.3 Experiment 2: Fine-Tuned English Models #### 1.3.1 facebook/bart-large-cnn - **Pros:** More verbose; includes “Naruto Uzumaki”. - **Cons:** Hallucination: misgendered Naruto as Kushina’s daughter. #### 1.3.2 csebuetnlp/mT5_multilingual_XLSum - **Issue:** Severe hallucinations; treated narrative like news, fabricating details (e.g., Konoha setting, BBC reporter). #### 1.3.3 Falconsai/text_summarization ```python # Load the fine-tuned summarization model summarizer = pipeline("summarization", model="Falconsai/text_summarization") # Experiment with increased max_length to get more detail summarizer = summarizer(naruto_story, max_length=562, min_length=100, do_sample=False) print("\n--- Fine-Tuned model on English Naruto Story ---") print(summarizer[0]['summary_text']) ``` - **Setup:** `max_length=562`, `min_length=100`, `do_sample=False`. - **Performance:** Rich, coherent summary including multiple characters and plot points; minor truncation at `max_length` cutoff. > **Conclusion:** For English narrative, **Falconsai/text_summarization** offers the best balance of detail and accuracy. ### 1.4 Experiment 3: Arabic Narrative Summarization - **Model:** csebuetnlp/mT5_multilingual_XLSum (with `min_length=100`). **Findings:** 1. Hallucinations persisted; invented BBC Arabic interview segments. 2. Other Arabic or multilingual models similarly fabricated content. 3. English-tuned models produced garbled output on Arabic input. > **Conclusion:** Off-the-shelf Arabic summarization models on Hugging Face currently exhibit unreliable hallucinations. Custom fine-tuning on Arabic narratives or larger Arabic LLMs may be required. --- ## 🌐 2. Machine Translation Deep Dive ### 2.1 Scope - **Focus:** Translate between English ↔ Modern Standard Arabic (MSA) and Arabic dialects. - **Models Tested:** 1. `facebook/nllb-200-distilled-600M` 2. `Helsinki-NLP/opus-mt-ar-en` 3. `Helsinki-NLP/opus-mt-en-ar` 4. `Helsinki-NLP/opus-mt-mul-en` ### 2.2 Experiment Results |Model|MSA ↔ EN|Dialectal AR → EN|Notes| |---|---|---|---| |nllb-200-distilled-600M|Strong, fluent|Partial transliteration (“Yasta I am tired”)|Requires explicit language codes.| |opus-mt-ar-en|Good formal AR → EN|Struggled; literal or omitted slang|Tends toward brevity.| |opus-mt-en-ar|Weak EN → AR|N/A|Incomplete outputs; unreliable.| |opus-mt-mul-en|Good formal AR → EN|Poor on dialects|Multilingual training offers no advantage on dialects.| > **Conclusion:** MSA translation is well-supported. Dialects remain a hurdle; **NLLB** shows promise via its recognition/transliteration of colloquialisms. Specialized fine-tuning or larger LLMs needed for robust dialect handling. --- ### 🧠 Final Summary for Day 3 Today’s deep dive revealed both the capabilities and current limitations of open-source models when applied to Arabic-centric tasks: 📝 **Summarization:** English summaries are generally handled well—especially by models like `Falconsai/text_summarization`—producing coherent and detailed outputs. However, Arabic summarization continues to struggle with hallucinations and fragmented narratives, underscoring the need for Arabic-specific fine-tuning and better cultural grounding. 🌐 **Translation:** Modern Standard Arabic (MSA) is reasonably well-supported across several models. In contrast, Arabic dialects remain a major challenge, often yielding transliterations or contextually inaccurate translations. Among tested models, `facebook/nllb-200-distilled-600M` showed the most potential, particularly when used with explicit language codes. More broadly, these experiments highlight the ongoing hurdles posed by linguistic diversity, dialectal variation, and cultural nuance—even for advanced multilingual systems. This experience strengthens my motivation to keep learning and, ultimately, contribute to building more inclusive tools for Arabic-speaking communities. 🌍💡 --- ### 🔭 Vision for Day 4 Tomorrow’s mission is to wrap up all text-focused pipelines, completing the core set of foundational NLP tasks before shifting gears into vision models. 📌 **Pipelines to Explore:** 1. **Question Answering** - Compare default vs. Arabic-optimized models - Test with both MSA and dialectal inputs - Evaluate performance on short vs. long contexts 2. **Named Entity Recognition (NER)** - Assess entity extraction accuracy in Arabic and English - Look for confusion or missed entities, especially with dialect-specific names or terms 3. **Fill-Mask** - Use models like `bert-base-multilingual-cased` and Arabic BERT variants - Observe predictions on varied inputs, including poetry, idioms, and slang 4. **Text Generation** - Experiment with `gpt2`, `mGPT`, and Arabic GPT models - Evaluate fluency, coherence, and hallucination tendencies --- 🔁 **Goal:** Continue comparing default models with fine-tuned alternatives. 💡 **Mindset:** We're not just running tests — we're mapping the current landscape of Arabic in open-source NLP. 🎯 **Outcome:** By the end of Day 4, we’ll have a comprehensive understanding of Hugging Face’s strengths and gaps in multilingual text processing.