Luo-Swahili Machine Translation Model (NLLB-based)
Model Details
- Model Name:
nllb-luo-swa-mt-v1
- Base Model: Fine-tuned from
facebook/nllb-200-distilled-600M
- Language Pair: Luo (
luo
) to Swahili (swa
) - Dataset: SalomonMetre13/luo_swa_arXiv_2501.11003
- Hugging Face Model ID:
SalomonMetre13/nllb-luo-swa-mt-v1
- Preliminary translation results : View THIS PDF
Description
This model is fine-tuned for translating text from Luo to Swahili using the NLLB-200 model architecture. The fine-tuning process involves extending the tokenizer's vocabulary with custom language tokens and training the model on a specific dataset. The model is designed to handle the nuances of translating between these two languages effectively.
Features
- Custom Tokenizer: Extended with special tokens for Luo and Swahili to improve translation accuracy.
- Training: Fine-tuned on a curated dataset specifically designed for Luo-Swahili translation.
- Evaluation: Uses BLEU score for performance evaluation, achieving high translation quality.
- Inference: Capable of translating new sentences and batches of text with efficient processing.
Usage
Installation
Ensure you have the necessary libraries installed:
pip install datasets transformers sacrebleu huggingface_hub accelerate torch
Fine-Tuning and Training
- Authentication: Log in to Hugging Face to access the model and dataset.
- Preprocessing: The dataset is preprocessed to include special language tokens.
- Training: The model is fine-tuned using the
Seq2SeqTrainer
with specified training arguments. - Evaluation: The model's performance will be evaluated using the BLEU metric.
Inference
Translate a Single Sentence
To translate a single sentence from Luo to Swahili, use the translate_custom_sentence
function. Here's an example:
from transformers import AutoTokenizer, AutoModelForSeq2SeqLM
# Load the fine-tuned model and tokenizer
model_name = "SalomonMetre13/nllb-luo-swa-mt-v1"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForSeq2SeqLM.from_pretrained(model_name)
def translate_custom_sentence(src_text: str, src_lang: str = "luo", tgt_lang: str = "swa") -> str:
formatted_text = f"<{src_lang}> {src_text.strip()}"
# Tokenize input
inputs = tokenizer(
formatted_text,
return_tensors="pt",
max_length=128,
truncation=True
).to(model.device)
# Generate translation
outputs = model.generate(
inputs.input_ids,
forced_bos_token_id=tokenizer.convert_tokens_to_ids(f"<{tgt_lang}>"),
max_length=150
)
# Decode and clean output
return tokenizer.decode(outputs[0], skip_special_tokens=True)
# Example usage
src_text = "Ang'o manoywai mibedo ja ngolo."
translation = translate_custom_sentence(src_text)
print(f"Luo: {src_text}")
print(f"Swahili: {translation}")
Translate a Batch of Sentences
For batch translation, use the translate_batch
function. This function processes multiple sentences at once, which can be more efficient for larger translation tasks.
from tqdm.auto import tqdm
def translate_batch(src_texts: list) -> list:
formatted_texts = [f"<{src_lang}> {text.strip()}" for text in src_texts]
inputs = tokenizer(
formatted_texts,
return_tensors="pt",
max_length=128,
truncation=True,
padding=True,
add_special_tokens=True
).to(model.device)
outputs = model.generate(
inputs.input_ids,
attention_mask=inputs.attention_mask,
forced_bos_token_id=tokenizer.convert_tokens_to_ids(f"<{tgt_lang}>"),
max_length=150
)
return tokenizer.batch_decode(outputs, skip_special_tokens=True)
# Example usage
batch_size = 64
src_texts = [
"Jajuok nomaki nyoro gotieno",
"Bende ihero oduma moboki?",
# Add more sentences as needed
]
translations = translate_batch(src_texts)
for src, tgt in zip(src_texts, translations):
print(f"Luo: {src}")
print(f"Swahili: {tgt}")
Performance
The model's performance will be evaluated using the BLEU score on the test set. So far, the following (about evaluation) should be noted:
- Eval Loss: 0.322
These metrics provide an indication of the translation quality and the model's ability to generalize to new data.
Limitations
- The model is trained with a maximum input length of 512 tokens, which is pretty good for everyday sentences but may limit its effectiveness on longer texts.
- The dataset used for fine-tuning may influence the model's performance on specific domains or styles of text.
Future Work
- Explore fine-tuning on additional datasets to improve robustness.
- Experiment with different training parameters and architectures to enhance performance.
Contact
For questions or feedback, please contact [[email protected]]
- Downloads last month
- 32
Model tree for SalomonMetre13/nllb-luo-swa-mt-v1
Base model
facebook/nllb-200-distilled-600M