YAML Metadata Warning: empty or missing yaml metadata in repo card (https://huggingface.co/docs/hub/model-cards#model-card-metadata)

🌐 T5-Based Multilingual Text Translator

This repository presents a fine-tuned T5-small model for multilingual text translation across English, French, German, Italian, and Portuguese. It includes quantization for efficient inference and speech synthesis support for accessibility.


πŸ“ Problem Statement

The goal is to translate text between English and multiple European languages using a transformer-based model. Instead of using black-box APIs, this project fine-tunes the T5 model on parallel multilingual corpora, enabling offline translation and potential customization.


πŸ“Š Dataset

  • Source: Custom parallel corpus (.txt files) with one-to-one sentence alignments.

  • Languages Supported:

    • English
    • French
    • German
    • Italian
    • Portuguese
  • Structure:

    • Each language has a corresponding .txt file.
    • Lines are aligned by index to form translation pairs.
  • Example Input Format:

    Source: translate English to French: I am a student.
    Target: Je suis un Γ©tudiant.
    

🧠 Model Details

  • Architecture: T5-small
  • Tokenizer: T5Tokenizer
  • Model: T5ForConditionalGeneration
  • Task Type: Sequence-to-Sequence Translation (Supervised Fine-tuning)

πŸ”§ Installation

pip install transformers datasets torch gtts

πŸš€ Loading the Model

from transformers import T5ForConditionalGeneration, T5Tokenizer
import torch

# Load quantized model (float16)
model = T5ForConditionalGeneration.from_pretrained("quantized_model", torch_dtype=torch.float16)
tokenizer = T5Tokenizer.from_pretrained("quantized_model")

# Translation example
source = "translate English to German: How are you?"
inputs = tokenizer(source, return_tensors="pt", padding=True, truncation=True)

with torch.no_grad():
    outputs = model.generate(**inputs)

print("Translated:", tokenizer.decode(outputs[0], skip_special_tokens=True))

πŸ“ˆ Performance Metrics

As this project is based on a single-epoch fine-tuning, performance metrics are not explicitly computed. For a production-level system, BLEU or ROUGE scores should be evaluated.


πŸ‹οΈ Fine-Tuning Details

πŸ“š Dataset Preparation

  • A total of 5 text files (english.txt, french.txt, etc.)
  • Each sentence aligned by index for parallel translation.

πŸ”§ Training Configuration

  • Epochs: 1
  • Batch size: 4
  • Max sequence length: 128
  • Model base: t5-small
  • Framework: Hugging Face Transformers + PyTorch
  • Evaluation strategy: 10% test split

πŸ”„ Quantization

Post-training quantization was performed using .half() precision (FP16) to reduce model size and improve inference speed.

# Load full-precision model
model_fp32 = T5ForConditionalGeneration.from_pretrained("model")

# Convert to half precision
model_fp16 = model_fp32.half()
model_fp16.save_pretrained("quantized_model")

Model Size Comparison:

Type Size (KB)
FP32 (Original) ~6,904 KB
FP16 (Quantized) ~3,452 KB

πŸ“ Repository Structure

.
β”œβ”€β”€ model/                       # Contains FP32 model files
β”‚   β”œβ”€β”€ config.json
β”‚   β”œβ”€β”€ model.safetensors
β”‚   β”œβ”€β”€ tokenizer_config.json
β”‚   └── ...
β”œβ”€β”€ quantized_model/            # Contains FP16 quantized model files
β”‚   β”œβ”€β”€ config.json
β”‚   β”œβ”€β”€ model.safetensors
β”‚   β”œβ”€β”€ tokenizer_config.json
β”‚   └── ...
β”œβ”€β”€ README.md                   # Documentation
└── multilingual_translator.py  # Training and inference script

⚠️ Limitations

  • Trained on a small dataset with only one epoch β€” may not generalize well to all phrases or complex sentences.
  • Language coverage is limited to 5 predefined languages.
  • gTTS is dependent on Google API and requires internet access.

🀝 Contributing

Feel free to submit issues or PRs to add more language pairs, extend training, or integrate UI for real-time use.

Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support