# 🌐 T5-Based Multilingual Text Translator This repository presents a fine-tuned T5-small model for multilingual text translation across English, French, German, Italian, and Portuguese. It includes quantization for efficient inference and speech synthesis support for accessibility. --- ## πŸ“ Problem Statement The goal is to translate text between English and multiple European languages using a transformer-based model. Instead of using black-box APIs, this project fine-tunes the T5 model on parallel multilingual corpora, enabling offline translation and potential customization. --- ## πŸ“Š Dataset - **Source:** Custom parallel corpus (`.txt` files) with one-to-one sentence alignments. - **Languages Supported:** - English - French - German - Italian - Portuguese - **Structure:** - Each language has a corresponding `.txt` file. - Lines are aligned by index to form translation pairs. - **Example Input Format:** ``` Source: translate English to French: I am a student. Target: Je suis un Γ©tudiant. ``` --- ## 🧠 Model Details - **Architecture:** T5-small - **Tokenizer:** `T5Tokenizer` - **Model:** `T5ForConditionalGeneration` - **Task Type:** Sequence-to-Sequence Translation (Supervised Fine-tuning) --- ## πŸ”§ Installation ```bash pip install transformers datasets torch gtts ``` --- ## πŸš€ Loading the Model ```python from transformers import T5ForConditionalGeneration, T5Tokenizer import torch # Load quantized model (float16) model = T5ForConditionalGeneration.from_pretrained("quantized_model", torch_dtype=torch.float16) tokenizer = T5Tokenizer.from_pretrained("quantized_model") # Translation example source = "translate English to German: How are you?" inputs = tokenizer(source, return_tensors="pt", padding=True, truncation=True) with torch.no_grad(): outputs = model.generate(**inputs) print("Translated:", tokenizer.decode(outputs[0], skip_special_tokens=True)) ``` --- ## πŸ“ˆ Performance Metrics As this project is based on a single-epoch fine-tuning, performance metrics are not explicitly computed. For a production-level system, BLEU or ROUGE scores should be evaluated. --- ## πŸ‹οΈ Fine-Tuning Details ### πŸ“š Dataset Preparation - A total of 5 text files (`english.txt`, `french.txt`, etc.) - Each sentence aligned by index for parallel translation. ### πŸ”§ Training Configuration - **Epochs:** 1 - **Batch size:** 4 - **Max sequence length:** 128 - **Model base:** `t5-small` - **Framework:** Hugging Face Transformers + PyTorch - **Evaluation strategy:** 10% test split --- ## πŸ”„ Quantization Post-training quantization was performed using `.half()` precision (FP16) to reduce model size and improve inference speed. ```python # Load full-precision model model_fp32 = T5ForConditionalGeneration.from_pretrained("model") # Convert to half precision model_fp16 = model_fp32.half() model_fp16.save_pretrained("quantized_model") ``` **Model Size Comparison:** | Type | Size (KB) | |------------------|-----------| | FP32 (Original) | ~6,904 KB | | FP16 (Quantized) | ~3,452 KB | --- ## πŸ“ Repository Structure ``` . β”œβ”€β”€ model/ # Contains FP32 model files β”‚ β”œβ”€β”€ config.json β”‚ β”œβ”€β”€ model.safetensors β”‚ β”œβ”€β”€ tokenizer_config.json β”‚ └── ... β”œβ”€β”€ quantized_model/ # Contains FP16 quantized model files β”‚ β”œβ”€β”€ config.json β”‚ β”œβ”€β”€ model.safetensors β”‚ β”œβ”€β”€ tokenizer_config.json β”‚ └── ... β”œβ”€β”€ README.md # Documentation └── multilingual_translator.py # Training and inference script ``` --- ## ⚠️ Limitations - Trained on a small dataset with only one epoch β€” may not generalize well to all phrases or complex sentences. - Language coverage is limited to 5 predefined languages. - gTTS is dependent on Google API and requires internet access. --- ## 🀝 Contributing Feel free to submit issues or PRs to add more language pairs, extend training, or integrate UI for real-time use.