Kurdish-English Machine Translation with Transformers
This repository focuses on fine-tuning a Kurdish-English machine translation model using Hugging Face's transformers
library with MarianMT.
The model is trained on a custom parallel corpus with a detailed pipeline that includes data preprocessing, bidirectional training, evaluation, and inference.
This model is a product of the AI Center of Kurdistan University.
Table of Contents
Introduction
This project fine-tunes a MarianMT model for Kurdish-English translation on a custom parallel corpus. Training is configured for bidirectional translation, enabling model use in both language directions.
Requirements
- Python 3.8+
- Hugging Face Transformers
- Datasets library
- SentencePiece
- PyTorch 1.9+
- CUDA (for GPU support)
Setup
- Clone the repository and install dependencies.
- Ensure GPU availability.
- Prepare your Kurdish-English corpus in CSV format.
Pipeline Overview
Data Preparation
- Corpus: A Kurdish-English parallel corpus in CSV format with columns
Source
(Kurdish) andTarget
(English). - Path Definition: Specify the corpus path in the configuration.
Training SentencePiece Tokenizer
- Vocabulary Size: 32,000
- Source Data: The tokenizer is trained on both the primary Kurdish corpus and the English dataset to create shared subword tokens.
Model and Tokenizer Setup
- Model:
Helsinki-NLP/opus-mt-en-mul
pre-trained MarianMT model. - Tokenizer: MarianMT tokenizer aligned with the model, with source and target languages set dynamically.
Tokenization and Dataset Preparation
- Train-Validation Split: 90% train, 10% validation split.
- Maximum Sequence Length: 128 tokens for both source and target sequences.
- Bidirectional Tokenization: Prepare tokenized sequences for both Kurdish-English and English-Kurdish translation.
Training Configuration
- Learning Rate: 2e-5
- Batch Size: 4 (per device, for both training and evaluation)
- Weight Decay: 0.01
- Evaluation Strategy: Per epoch
- Epochs: 3
- Logging: Logs saved every 100 steps, with TensorBoard logging enabled
- Output Directory:
./results
- Device: GPU 1 explicitly set
Evaluation and Metrics
The following metrics are computed on the validation dataset:
- BLEU: Measures translation quality based on precision and recall of n-grams.
- METEOR: Considers synonymy and stem matches.
- BERTScore: Evaluates semantic similarity with BERT embeddings.
Inference
Inference includes bidirectional translation capabilities:
- Source to Target: English to Kurdish translation.
- Target to Source: Kurdish to English translation.
Results
The fine-tuned model and tokenizer are saved to ./fine-tuned-marianmt
, including evaluation metrics across BLEU, METEOR, and BERTScore.
"""
Write the content to README.md
file_path = "/mnt/data/README.md" with open(file_path, "w") as readme_file: readme_file.write(readme_content)
file_path
- Downloads last month
- 0