Luo-Swahili Machine Translation Model (NLLB-based)

Model Details

Model Name: nllb-luo-swa-mt-v1
Base Model: Fine-tuned from facebook/nllb-200-distilled-600M
Language Pair: Luo (luo) to Swahili (swa)
Dataset: SalomonMetre13/luo_swa_arXiv_2501.11003
Hugging Face Model ID: SalomonMetre13/nllb-luo-swa-mt-v1
Preliminary translation results : View THIS PDF

Description

This model is fine-tuned for translating text from Luo to Swahili using the NLLB-200 model architecture. The fine-tuning process involves extending the tokenizer's vocabulary with custom language tokens and training the model on a specific dataset. The model is designed to handle the nuances of translating between these two languages effectively.

Features

Custom Tokenizer: Extended with special tokens for Luo and Swahili to improve translation accuracy.
Training: Fine-tuned on a curated dataset specifically designed for Luo-Swahili translation.
Evaluation: Uses BLEU score for performance evaluation, achieving high translation quality.
Inference: Capable of translating new sentences and batches of text with efficient processing.

Usage

Installation

Ensure you have the necessary libraries installed:

pip install datasets transformers sacrebleu huggingface_hub accelerate torch

Fine-Tuning and Training

Authentication: Log in to Hugging Face to access the model and dataset.
Preprocessing: The dataset is preprocessed to include special language tokens.
Training: The model is fine-tuned using the Seq2SeqTrainer with specified training arguments.
Evaluation: The model's performance will be evaluated using the BLEU metric.

Inference

Translate a Single Sentence

To translate a single sentence from Luo to Swahili, use the translate_custom_sentence function. Here's an example:

from transformers import AutoTokenizer, AutoModelForSeq2SeqLM

# Load the fine-tuned model and tokenizer
model_name = "SalomonMetre13/nllb-luo-swa-mt-v1"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForSeq2SeqLM.from_pretrained(model_name)

def translate_custom_sentence(src_text: str, src_lang: str = "luo", tgt_lang: str = "swa") -> str:
    formatted_text = f"<{src_lang}> {src_text.strip()}"

    # Tokenize input
    inputs = tokenizer(
        formatted_text,
        return_tensors="pt",
        max_length=128,
        truncation=True
    ).to(model.device)

    # Generate translation
    outputs = model.generate(
        inputs.input_ids,
        forced_bos_token_id=tokenizer.convert_tokens_to_ids(f"<{tgt_lang}>"),
        max_length=150
    )

    # Decode and clean output
    return tokenizer.decode(outputs[0], skip_special_tokens=True)

# Example usage
src_text = "Ang'o manoywai mibedo ja ngolo."
translation = translate_custom_sentence(src_text)
print(f"Luo: {src_text}")
print(f"Swahili: {translation}")

Translate a Batch of Sentences

For batch translation, use the translate_batch function. This function processes multiple sentences at once, which can be more efficient for larger translation tasks.

from tqdm.auto import tqdm

def translate_batch(src_texts: list) -> list:
    formatted_texts = [f"<{src_lang}> {text.strip()}" for text in src_texts]

    inputs = tokenizer(
        formatted_texts,
        return_tensors="pt",
        max_length=128,
        truncation=True,
        padding=True,
        add_special_tokens=True
    ).to(model.device)

    outputs = model.generate(
        inputs.input_ids,
        attention_mask=inputs.attention_mask,
        forced_bos_token_id=tokenizer.convert_tokens_to_ids(f"<{tgt_lang}>"),
        max_length=150
    )

    return tokenizer.batch_decode(outputs, skip_special_tokens=True)

# Example usage
batch_size = 64
src_texts = [
    "Jajuok nomaki nyoro gotieno",
    "Bende ihero oduma moboki?",
    # Add more sentences as needed
]
translations = translate_batch(src_texts)
for src, tgt in zip(src_texts, translations):
    print(f"Luo: {src}")
    print(f"Swahili: {tgt}")

Performance

The model's performance will be evaluated using the BLEU score on the test set. So far, the following (about evaluation) should be noted:

Eval Loss: 0.322

These metrics provide an indication of the translation quality and the model's ability to generalize to new data.

Limitations

The model is trained with a maximum input length of 512 tokens, which is pretty good for everyday sentences but may limit its effectiveness on longer texts.
The dataset used for fine-tuning may influence the model's performance on specific domains or styles of text.

Future Work

Explore fine-tuning on additional datasets to improve robustness.
Experiment with different training parameters and architectures to enhance performance.

Contact

For questions or feedback, please contact [[email protected]]

SalomonMetre13
/

nllb-luo-swa-mt-v1