---
library_name: transformers
license: mit
datasets:
- sawadogosalif/MooreFRCollections
metrics:
- bleu
- loss
---
# Model Card for Model ID

<!-- Provide a quick summary of what the model is/does. -->


# MooreFR-SaChi Translation Model

⚠️ **WARNING**: This is only a template for researchers and developers interested in advancing work on languages not well-represented in large language models. For more comprehensive approaches, please consult the work of David Dale[site](https://daviddale.ru/) or explore [github my repository](https://github.com/sawadogosalif/SaChi/) for further insights and methodologies.

## Model Details

This model is a fine-tuning of `nllb-200-distilled-600M` specialized for French-Moore (Mossi) language translation.\n It has been trained to handle translations between French (`fr_Latn`) and Moore (`moor_Latn`), with particularly strong performance in the French to Moore direction.

- **Base Model**: facebook/nllb-200-distilled-600M
- **Languages**: French (`fr_Latn`) ↔ Moore (`moor_Latn`)
- **Training Dataset**: [MooreFRCollections](https://huggingface.co/datasets/sawadogosalif/MooreFRCollections)
- **Performance**: 
  - BLEU Score: 39.1 (direction: `fra_Latn → moor_Latn`) :( need improvement)
  - Training Loss: 1.01 (validation set of 1000 examples) :( need improvement)
- **Training Time**: 2 hours on T4 GPU (3 epochs) 

## Usage

Here's how to use this model for translation:

```python
import time
from transformers import NllbTokenizer, AutoModelForSeq2SeqLM
# Load model and tokenizer
MODEL_URL = "sawadogosalif/MooreFR-SaChi-translationv0"
model = AutoModelForSeq2SeqLM.from_pretrained(MODEL_URL)
tokenizer = NllbTokenizer.from_pretrained(MODEL_URL)
# Fix tokenizer for Moore language
def fix_tokenizer(tokenizer, new_lang):
    """
    Adds a new language token to the tokenizer and updates ID mappings.
    
    - Adds the special token if it doesn't already exist
    - Initializes or updates `lang_code_to_id` and `id_to_lang_code` using `getattr` to avoid repeated checks
    """
    if new_lang not in tokenizer.additional_special_tokens:
        tokenizer.add_special_tokens({'additional_special_tokens': [new_lang]})
    
    tokenizer.lang_code_to_id = getattr(tokenizer, 'lang_code_to_id', {})
    tokenizer.id_to_lang_code = getattr(tokenizer, 'id_to_lang_code', {})
    
    if new_lang not in tokenizer.lang_code_to_id:
        new_lang_id = tokenizer.convert_tokens_to_ids(new_lang)
        tokenizer.lang_code_to_id[new_lang] = new_lang_id
        tokenizer.id_to_lang_code[new_lang_id] = new_lang
    
    return tokenizer
# Initialize tokenizer with Moore language
fix_tokenizer(tokenizer, 'moor_Latn')
# Translation function
def translate(text, src_lang='fr_Latn', tgt_lang='moor_Latn', a=32, b=3, max_input_length=1024, num_beams=4, **kwargs):
    tokenizer.src_lang = src_lang
    tokenizer.tgt_lang = tgt_lang
    inputs = tokenizer(text, return_tensors='pt', padding=True, truncation=True, max_length=max_input_length)
    result = model.generate(
        **inputs.to(model.device),
        forced_bos_token_id=tokenizer.convert_tokens_to_ids(tgt_lang),
        max_new_tokens=int(a + b * inputs.input_ids.shape[1]),
        num_beams=num_beams,
        **kwargs
    )
    return tokenizer.batch_decode(result, skip_special_tokens=True)
# Example usage
french_text = "Je suis né à Ouagadougou. J'ai demenagé à Banfora pour mes etudes"
moore_translation = translate(french_text, 'fr_Latn', 'moor_Latn')
print(moore_translation)
# Expected output: ['Mam doga Ouadagoou. Mam kẽnga Banfora m sẽn na yɩl n tɩ karem be.']
```

### Alternative Translation Function

For more flexibility, you can use this enhanced translation function:

```python
def translate_v2(text, model, tokenizer, src_lang='fr_Latn', tgt_lang='moor_Latn',
               max_length='auto', num_beams=4, no_repeat_ngram_size=4, n_out=None, **kwargs):
    tokenizer.src_lang = src_lang
    encoded = tokenizer(text, return_tensors="pt", truncation=True, max_length=512)
    if max_length == 'auto':
        max_length = int(32 + 2.0 * encoded.input_ids.shape[1])
    model.eval()
    generated_tokens = model.generate(
        **encoded.to(model.device),
        forced_bos_token_id=tokenizer.lang_code_to_id[tgt_lang],
        max_length=max_length,
        num_beams=num_beams,
        no_repeat_ngram_size=no_repeat_ngram_size,
        num_return_sequences=n_out or 1,
        **kwargs
    )
    out = tokenizer.batch_decode(generated_tokens, skip_special_tokens=True)
    if isinstance(text, str) and n_out is None:
        return out[0]
    return out
```

## Training

This model was trained using the [SaChi training framework](https://github.com/sawadogosalif/SaChi) with the following parameters:

```yaml
model:
  name: "facebook/nllb-200-distilled-600M"
  save_path: "./models/nllb-moore-finetuned"
  new_lang_code: "moore_open"
training:
  batch_size: 16
  num_epochs: 3
  learning_rate: 1e-4
  warmup_steps: 1000
  max_length: 128
  accumulation_steps: 1
  eval_steps: 1000
  save_steps: 5000
  early_stopping_patience: 5
  fp16: true
  resume_from: null
  max_grad_norm: 1.0
data:
  dataset_name: "sawadogosalif/MooreFRCollections"
  train_size: 0.8
  test_size: 0.1
  val_size: 0.1
  random_seed: 2025
  src_col: "source"
  tgt_col: "target"
  src_lang_col: "french"
  tgt_lang_col: "moore"
evaluation:
  num_samples: 10
  num_beams: 5
  no_repeat_ngram_size: 3
```

The training was completed in approximately 2 hours on a T4 GPU for 3 epochs.

## Dataset

This model was trained on the [MooreFRCollections](https://huggingface.co/datasets/sawadogosalif/MooreFRCollections) dataset, which contains parallel texts in French and Moore languages.

## Limitations

- The model performs best with standard French input text.
- Performance may vary with highly technical, specialized, or colloquial language.
- The model may not handle certain Moore dialectal variations perfectly.

## Source Code

The training code is available in the [SaChi repository](https://github.com/sawadogosalif/SaChi). 

## Citation


```
@misc{author = {Sawadogo, Salif},
  title = {MooreFR-SaChi-translationv0},
  year = {202},
  publisher = {Hugging Face},
  howpublished = {\url{https://huggingface.co/sawadogosalif/MooreFR-SaChi-translationv0}}
}
```