Spaces:

AmineDubs
/

Scripts_translation_to_arabic

Sleeping

App Files Files Community

amine_dubs commited on Apr 28

Commit

dbe4e2f

1 Parent(s): 20ee4d2

Enhanced prompt engineering with cultural sensitivity and multi-language support

Browse files

Files changed (3) hide show

backend/main.py +75 -39
project_details.txt +16 -0
project_report.md +98 -33

backend/main.py CHANGED Viewed

@@ -5,7 +5,9 @@ from fastapi.templating import Jinja2Templates
 from typing import List, Optional
 import shutil
 import os
-from transformers import pipeline, MarianMTModel, MarianTokenizer
 import traceback # Ensure traceback is imported
 # --- Configuration ---
@@ -27,62 +29,96 @@ app.mount("/static", StaticFiles(directory=STATIC_DIR), name="static")
 # Ensure the templates directory exists (FastAPI doesn't create it)
 templates = Jinja2Templates(directory=TEMPLATE_DIR)
-# --- Placeholder for Model Loading ---
-# Initialize the translation pipeline (load the model)
-# Consider loading the model on startup to avoid delays during requests
-# Define model name
-MODEL_NAME = "Helsinki-NLP/opus-mt-en-ar"
 CACHE_DIR = "/app/.cache" # Explicitly define cache directory
-translator = None # Initialize translator as None
 try:
-    print("--- Loading Model ---") # Add a clear marker
-    print(f"Loading tokenizer for {MODEL_NAME} using MarianTokenizer...")
-    # Use MarianTokenizer directly and specify cache_dir
-    tokenizer = MarianTokenizer.from_pretrained(MODEL_NAME, cache_dir=CACHE_DIR)
-    print(f"Loading model for {MODEL_NAME} using MarianMTModel...")
-    # Use MarianMTModel directly and specify cache_dir
-    model = MarianMTModel.from_pretrained(MODEL_NAME, cache_dir=CACHE_DIR)
-    print(f"Initializing translation pipeline for {MODEL_NAME}...")
-    # Pass the loaded objects to the pipeline
-    translator = pipeline("translation", model=model, tokenizer=tokenizer)
     print("--- Model Loaded Successfully ---")
 except Exception as e:
     print(f"--- ERROR Loading Model ---")
     print(f"Error loading model or tokenizer {MODEL_NAME}: {e}")
     traceback.print_exc() # Print full traceback for loading error
-    # Keep translator as None
 # --- Helper Functions ---
 def translate_text_internal(text: str, source_lang: str, target_lang: str = "ar") -> str:
-    """Internal function to handle text translation using the loaded model."""
-    if translator is None:
-        # If the model failed to load, raise an error instead of returning a placeholder
         raise HTTPException(status_code=503, detail="Translation service is unavailable (model not loaded).")
-    # Log the request details
-    print(f"Translation Request - Source Lang: {source_lang}, Target Lang: {target_lang}")
-    print(f"Input Text: {text}")
-    # --- Actual Translation Logic (using Hugging Face pipeline) ---
     try:
-        # The Helsinki model expects the text directly
-        result = translator(text)
-        if result and isinstance(result, list) and 'translation_text' in result[0]:
-            translated_text = result[0]['translation_text']
-            print(f"Raw Translation Output: {translated_text}")
-            # Return the actual translated text
-            return translated_text
-        else:
-            print(f"Unexpected translation result format: {result}")
-            raise HTTPException(status_code=500, detail="Translation failed: Unexpected model output format.")
     except Exception as e:
-        print(f"Error during translation pipeline: {e}")
         traceback.print_exc()
-        raise HTTPException(status_code=500, detail=f"Translation failed: {e}")
 # --- Function to extract text ---
 async def extract_text_from_file(file: UploadFile) -> str:

 from typing import List, Optional
 import shutil
 import os
+# Use AutoModel for flexibility
+from transformers import AutoTokenizer, AutoModelForSeq2SeqLM
+import torch # Ensure torch is imported if using generate directly
 import traceback # Ensure traceback is imported
 # --- Configuration ---
 # Ensure the templates directory exists (FastAPI doesn't create it)
 templates = Jinja2Templates(directory=TEMPLATE_DIR)
+# --- Model Loading ---
+# Define model name - Switched to FLAN-T5
+MODEL_NAME = "google/flan-t5-small"
 CACHE_DIR = "/app/.cache" # Explicitly define cache directory
+model = None
+tokenizer = None
 try:
+    print("--- Loading Model ---")
+    print(f"Loading tokenizer for {MODEL_NAME} using AutoTokenizer...")
+    # Use AutoTokenizer and specify cache_dir
+    tokenizer = AutoTokenizer.from_pretrained(MODEL_NAME, cache_dir=CACHE_DIR)
+    print(f"Loading model for {MODEL_NAME} using AutoModelForSeq2SeqLM...")
+    # Use AutoModelForSeq2SeqLM and specify cache_dir
+    model = AutoModelForSeq2SeqLM.from_pretrained(MODEL_NAME, cache_dir=CACHE_DIR)
     print("--- Model Loaded Successfully ---")
 except Exception as e:
     print(f"--- ERROR Loading Model ---")
     print(f"Error loading model or tokenizer {MODEL_NAME}: {e}")
     traceback.print_exc() # Print full traceback for loading error
+    # Keep model and tokenizer as None
 # --- Helper Functions ---
 def translate_text_internal(text: str, source_lang: str, target_lang: str = "ar") -> str:
+    """Internal function to handle text translation using the loaded model via prompting."""
+    if model is None or tokenizer is None:
+        # If the model/tokenizer failed to load, raise an error
         raise HTTPException(status_code=503, detail="Translation service is unavailable (model not loaded).")
+    # --- Enhanced Prompt Engineering ---
+    # Map source language codes to full language names for better model understanding
+    language_map = {
+        "en": "English",
+        "fr": "French",
+        "es": "Spanish",
+        "de": "German",
+        "zh": "Chinese",
+        "ru": "Russian",
+        "ja": "Japanese",
+        "hi": "Hindi",
+        "pt": "Portuguese",
+        "tr": "Turkish",
+        "ko": "Korean",
+        "it": "Italian"
+        # Add more languages as needed
+    }
+    # Get the full language name, or use the code if not in our map
+    source_lang_name = language_map.get(source_lang, source_lang)
+    # Craft a more detailed prompt that emphasizes meaning over literal translation
+    # and focuses on eloquence and cultural sensitivity
+    prompt = f"""Translate the following {source_lang_name} text into Modern Standard Arabic (Fusha).
+Focus on conveying the meaning elegantly using proper Balagha (Arabic eloquence).
+Adapt any cultural references or idioms appropriately rather than translating literally.
+Ensure the translation reads naturally to a native Arabic speaker.
+Text to translate:
+{text}"""
+    print(f"Translation Request - Source Lang: {source_lang} ({source_lang_name}), Target Lang: {target_lang}")
+    print(f"Using Enhanced Prompt for Balagha and Cultural Sensitivity")
+    # --- Actual Translation Logic (using model.generate) ---
     try:
+        # Tokenize the prompt
+        inputs = tokenizer(prompt, return_tensors="pt", padding=True, truncation=True)
+        # Generate the translation with parameters tuned for quality
+        outputs = model.generate(
+            **inputs,
+            max_length=512,  # Adjust based on expected output length
+            num_beams=5,     # Increased for better quality
+            length_penalty=1.0, # Encourage slightly longer outputs for natural flow
+            top_k=50,        # More diverse word choices
+            top_p=0.95,      # Sample from higher probability tokens for fluency
+            early_stopping=True
+        )
+        # Decode the generated tokens
+        translated_text = tokenizer.decode(outputs[0], skip_special_tokens=True)
+        print(f"Raw Translation Output: {translated_text}")
+        return translated_text
     except Exception as e:
+        print(f"Error during model generation: {e}")
         traceback.print_exc()
+        raise HTTPException(status_code=500, detail=f"Translation failed during generation: {e}")
 # --- Function to extract text ---
 async def extract_text_from_file(file: UploadFile) -> str:

project_details.txt CHANGED Viewed

@@ -2,6 +2,21 @@
 This guide outlines the steps to deploy the AI Translator web application to Hugging Face (HF) Spaces using Docker.
 ## Prerequisites
 1.  **Docker:** Ensure Docker Desktop (or Docker Engine on Linux) is installed and running on your local machine.
@@ -48,6 +63,7 @@ This guide outlines the steps to deploy the AI Translator web application to Hug
         # AI Translator
         This Space hosts an AI-powered web application for translating text and documents to/from Arabic.
         Built with FastAPI, Docker, and Hugging Face Transformers.
         ```
         *   **Important:** Ensure `app_port` matches the port exposed in your `backend/Dockerfile` (which is `8000` in the current setup).

 This guide outlines the steps to deploy the AI Translator web application to Hugging Face (HF) Spaces using Docker.
+## Application Features
+1. **Eloquent Arabic Translation:** The application focuses on producing high-quality Arabic translations that prioritize meaning and eloquence (Balagha) over literal translations.
+2. **Cultural Sensitivity:** Translations adapt cultural references and idioms appropriately for the target audience.
+3. **Multi-Language Support:** Translation from 12 languages (English, French, Spanish, German, Chinese, Russian, Japanese, Hindi, Portuguese, Turkish, Korean, Italian) to Modern Standard Arabic.
+4. **Document Processing:** Support for translating text from various document formats (PDF, DOCX, TXT).
+5. **Advanced Prompt Engineering:** Uses carefully designed prompts with the FLAN-T5 model to achieve eloquent, culturally-aware translations.
+## Translation Model Details
+* **Model:** `google/flan-t5-small` - An instruction-tuned language model capable of following specific translation directions
+* **Prompt Approach:** Uses explicit instructions to guide the model toward eloquent Arabic (Balagha) and cultural adaptation
+* **Generation Parameters:** Optimized beam search, length penalty, and sampling parameters for higher quality output
+* **Scalability:** The small model variant balances quality with reasonable resource requirements for deployment
 ## Prerequisites
 1.  **Docker:** Ensure Docker Desktop (or Docker Engine on Linux) is installed and running on your local machine.
         # AI Translator
         This Space hosts an AI-powered web application for translating text and documents to/from Arabic.
+        The goal is to provide accurate and fluent translations that also respect cultural nuances and differences.
         Built with FastAPI, Docker, and Hugging Face Transformers.
         ```
         *   **Important:** Ensure `app_port` matches the port exposed in your `backend/Dockerfile` (which is `8000` in the current setup).

project_report.md CHANGED Viewed

@@ -64,7 +64,7 @@ This report details the development process of an AI-powered web application des
     *   **Description:** Uploads a document, extracts its text, and translates it.
     *   **Request Body (Multipart Form Data):**
         *   `file` (UploadFile): The document file (.pdf, .docx, .xlsx, .pptx, .txt).
-        *   `source_lang` (str): The source language code.
         *   `target_lang` (str): The target language code (currently fixed to 'ar').
     *   **Response (`JSONResponse`):**
         *   `original_filename` (str): The name of the uploaded file.
@@ -101,46 +101,106 @@ Key Python libraries used:
 7.  **Document Backend Processing:** FastAPI receives the file, saves it temporarily, extracts text using appropriate libraries (PyMuPDF, python-docx, etc.), calls the internal translation function, cleans up the temporary file, and returns the result.
 8.  **Response Handling:** Frontend JS receives the JSON response and updates the UI to display the translation or an error message.
-## 4. Prompt Engineering and Optimization
-### 4.1. Initial Prompt Design
-The core requirement is to translate *from* a source language *to* Arabic (MSA Fusha) with a focus on meaning and eloquence (Balagha), avoiding overly literal translations.
-The initial prompt structure designed for the `translate_text_internal` function is:
 ```
-Translate the following text from {source_lang} to Arabic (Modern Standard Arabic - Fusha) precisely. Do not provide a literal translation; focus on conveying the meaning accurately while respecting Arabic eloquence (balagha) by rephrasing if necessary:
-{text}
 ```
-### 4.2. Rationale
-*   **Explicit Target:** Specifies "Arabic (Modern Standard Arabic - Fusha)" to guide the model towards the desired dialect and register.
-*   **Precision Instruction:** "precisely" encourages accuracy.
-*   **Constraint against Literal Translation:** "Do not provide a literal translation" directly addresses a potential pitfall.
-*   **Focus on Meaning:** "focus on conveying the meaning accurately" sets the primary goal.
-*   **Eloquence (Balagha):** "respecting Arabic eloquence (balagha)" introduces the key stylistic requirement.
-*   **Mechanism:** "by rephrasing if necessary" suggests *how* to achieve non-literal translation and eloquence.
-*   **Clear Input:** `{text}` placeholder clearly separates the instruction from the input text.
-*   **Source Language Context:** `{source_lang}` provides context, which can be crucial for disambiguation.
-### 4.3. Testing and Refinement (Planned/Hypothetical)
-*(This section would be filled in after actual model integration and testing)*
-*   **Model Selection:** The choice of model (e.g., a fine-tuned NLLB model, AraT5, or a large multilingual model like Qwen or Llama adapted for translation) will significantly impact performance. Initial tests would involve selecting a candidate model from Hugging Face Hub known for strong multilingual or English-Arabic capabilities.
-*   **Baseline Test:** Translate sample sentences/paragraphs using the initial prompt and evaluate the output quality based on accuracy, fluency, and adherence to Balagha principles.
-*   **Prompt Variations:**
-    *   *Simpler Prompts:* Test shorter prompts (e.g., "Translate to eloquent MSA Arabic: {text}") to see if the model can infer the constraints.
-    *   *More Explicit Examples (Few-Shot):* If needed, add examples within the prompt (though this increases complexity and token count): "Translate ... Example: 'Hello world' -> 'مرحباً بالعالم' (eloquent). Input: {text}"
-    *   *Emphasis:* Use different phrasing or emphasis (e.g., "Prioritize conveying the core meaning over word-for-word translation.")
-*   **Parameter Tuning:** Experiment with model generation parameters (e.g., `temperature`, `top_k`, `num_beams` if using beam search) available through the `transformers` pipeline or `generate` method to influence output style and creativity.
-*   **Evaluation Metrics:**
-    *   *Human Evaluation:* Subjective assessment by Arabic speakers focusing on accuracy, naturalness, and eloquence.
-    *   *Automated Metrics (with caution):* BLEU, METEOR scores against reference translations (if available), primarily for tracking relative improvements during iteration, acknowledging their limitations for stylistic nuances like Balagha.
-*   **Final Prompt Justification:** Based on the tests, the prompt that consistently produces the best balance of accurate meaning and desired Arabic style will be chosen. The current prompt is a strong starting point based on explicitly stating all requirements.
 ## 5. Frontend Design and User Experience
@@ -222,6 +282,11 @@ Translate the following text from {source_lang} to Arabic (Modern Standard Arabi
 *   **Add More Document Types:** Support additional formats if required.
 *   **Testing:** Implement unit and integration tests for backend logic.
 ## 8. Conclusion
 This project successfully lays the foundation for an AI-powered translation web service focusing on high-quality Arabic translation. The FastAPI backend provides a robust API, and the frontend offers a simple interface for text and document translation. Dockerization ensures portability and simplifies deployment to platforms like Hugging Face Spaces. Key next steps involve integrating a suitable translation model and refining the prompt engineering based on real-world testing.

     *   **Description:** Uploads a document, extracts its text, and translates it.
     *   **Request Body (Multipart Form Data):**
         *   `file` (UploadFile): The document file (.pdf, .docx, .xlsx, .pptx, .txt).
+        *   `source_lang` (str):
         *   `target_lang` (str): The target language code (currently fixed to 'ar').
     *   **Response (`JSONResponse`):**
         *   `original_filename` (str): The name of the uploaded file.
 7.  **Document Backend Processing:** FastAPI receives the file, saves it temporarily, extracts text using appropriate libraries (PyMuPDF, python-docx, etc.), calls the internal translation function, cleans up the temporary file, and returns the result.
 8.  **Response Handling:** Frontend JS receives the JSON response and updates the UI to display the translation or an error message.
+## 4. Prompt Engineering and Translation Quality Control
+### 4.1. Desired Translation Characteristics
+The core requirement is to translate *from* a source language *to* Arabic (MSA Fusha) with a focus on meaning and eloquence (Balagha), avoiding overly literal translations. These goals typically fall under the umbrella of prompt engineering when using general large language models.
+### 4.2. Approach with Instruction-Tuned LLM (FLAN-T5)
+Due to persistent loading issues with the specialized `Helsinki-NLP` model and the desire to have more direct control over the translation process, the project switched to using `google/flan-t5-small`, an instruction-tuned language model.
+#### 4.2.1 Explicit Prompt Engineering
+The translation process uses carefully crafted prompts to guide the model toward high-quality Arabic translations. The `translate_text_internal` function in `main.py` constructs an enhanced prompt with the following components:
+```python
+prompt = f"""Translate the following {source_lang_name} text into Modern Standard Arabic (Fusha).
+Focus on conveying the meaning elegantly using proper Balagha (Arabic eloquence).
+Adapt any cultural references or idioms appropriately rather than translating literally.
+Ensure the translation reads naturally to a native Arabic speaker.
+Text to translate:
+{text}"""
 ```
+This prompt explicitly instructs the model to:
+- Use Modern Standard Arabic (Fusha) as the target language register
+- Emphasize eloquence (Balagha) in the translation style
+- Handle cultural references and idioms appropriately for an Arabic audience
+- Prioritize natural-sounding output over literal translation
+#### 4.2.2 Multi-Language Support
+The system supports multiple source languages through a language mapping system that converts ISO language codes to full language names for better model comprehension:
+```python
+language_map = {
+    "en": "English",
+    "fr": "French",
+    "es": "Spanish",
+    "de": "German",
+    "zh": "Chinese",
+    "ru": "Russian",
+    "ja": "Japanese",
+    "hi": "Hindi",
+    "pt": "Portuguese",
+    "tr": "Turkish",
+    "ko": "Korean",
+    "it": "Italian"
+    # Additional languages can be added as needed
+}
 ```
+Using full language names in the prompt (e.g., "Translate the following French text...") helps the model better understand the translation task compared to using language codes.
+#### 4.2.3 Generation Parameter Optimization
+To further improve translation quality, the model's generation parameters have been fine-tuned:
+```python
+outputs = model.generate(
+    **inputs,
+    max_length=512,     # Sufficient length for most translations
+    num_beams=5,        # Wider beam search for better quality
+    length_penalty=1.0, # Slightly favor longer, more complete translations
+    top_k=50,           # Consider diverse word choices
+    top_p=0.95,         # Focus on high-probability tokens for coherence
+    early_stopping=True
+)
+```
+These parameters work together to encourage:
+- More natural-sounding translations through beam search
+- Better handling of nuanced expressions
+- Appropriate length for preserving meaning
+- Balance between creativity and accuracy
+### 4.3. Testing and Refinement Process
+*   **Prompt Iteration:** The core refinement process involves testing different prompt phrasings with various text samples across supported languages. Each iteration aims to improve the model's understanding of:
+    - What constitutes eloquent Arabic (Balagha)
+    - How to properly adapt culturally-specific references
+    - When to prioritize meaning over literal translation
+*   **Cultural Sensitivity Testing:** Sample texts containing culturally-specific references, idioms, and metaphors from each supported language are used to evaluate how well the model adapts these elements for an Arabic audience.
+*   **Evaluation Metrics:**
+    *   *Human Evaluation:* Native Arabic speakers assess translations for:
+        - Eloquence (Balagha): Does the translation use appropriately eloquent Arabic?
+        - Cultural Adaptation: Are cultural references appropriately handled?
+        - Naturalness: Does the text sound natural to native speakers?
+        - Accuracy: Is the meaning preserved despite non-literal translation?
+    *   *Automated Metrics:* While useful as supplementary measures, metrics like BLEU are used with caution as they tend to favor more literal translations.
+*   **Model Limitations:** The current implementation with FLAN-T5-small shows promise but has limitations:
+    - It may struggle with very specialized technical content
+    - Some cultural nuances from less common language pairs may be missed
+    - Longer texts may lose coherence across paragraphs
+    Future work may explore larger model variants if these limitations prove significant.
 ## 5. Frontend Design and User Experience
 *   **Add More Document Types:** Support additional formats if required.
 *   **Testing:** Implement unit and integration tests for backend logic.
+## Project Log / Updates
+*   **2025-04-28:** Updated project requirements to explicitly include the need for the translation model to respect cultural differences and nuances in its output.
+*   **2025-04-28:** Switched translation model from `Helsinki-NLP/opus-mt-en-ar` to `google/flan-t5-small` due to persistent loading errors in the deployment environment and to enable direct prompt engineering for translation tasks.
 ## 8. Conclusion
 This project successfully lays the foundation for an AI-powered translation web service focusing on high-quality Arabic translation. The FastAPI backend provides a robust API, and the frontend offers a simple interface for text and document translation. Dockerization ensures portability and simplifies deployment to platforms like Hugging Face Spaces. Key next steps involve integrating a suitable translation model and refining the prompt engineering based on real-world testing.