--- library_name: transformers license: mit datasets: - sawadogosalif/MooreFRCollections metrics: - bleu - loss --- # Model Card for Model ID # MooreFR-SaChi Translation Model ⚠️ **WARNING**: This is only a template for researchers and developers interested in advancing work on languages not well-represented in large language models. For more comprehensive approaches, please consult the work of David Dale[site](https://daviddale.ru/) or explore [github my repository](https://github.com/sawadogosalif/SaChi/) for further insights and methodologies. ## Model Details This model is a fine-tuning of `nllb-200-distilled-600M` specialized for French-Moore (Mossi) language translation.\n It has been trained to handle translations between French (`fr_Latn`) and Moore (`moor_Latn`), with particularly strong performance in the French to Moore direction. - **Base Model**: facebook/nllb-200-distilled-600M - **Languages**: French (`fr_Latn`) ↔ Moore (`moor_Latn`) - **Training Dataset**: [MooreFRCollections](https://huggingface.co/datasets/sawadogosalif/MooreFRCollections) - **Performance**: - BLEU Score: 39.1 (direction: `fra_Latn → moor_Latn`) :( need improvement) - Training Loss: 1.01 (validation set of 1000 examples) :( need improvement) - **Training Time**: 2 hours on T4 GPU (3 epochs) ## Usage Here's how to use this model for translation: ```python import time from transformers import NllbTokenizer, AutoModelForSeq2SeqLM # Load model and tokenizer MODEL_URL = "sawadogosalif/MooreFR-SaChi-translationv0" model = AutoModelForSeq2SeqLM.from_pretrained(MODEL_URL) tokenizer = NllbTokenizer.from_pretrained(MODEL_URL) # Fix tokenizer for Moore language def fix_tokenizer(tokenizer, new_lang): """ Adds a new language token to the tokenizer and updates ID mappings. - Adds the special token if it doesn't already exist - Initializes or updates `lang_code_to_id` and `id_to_lang_code` using `getattr` to avoid repeated checks """ if new_lang not in tokenizer.additional_special_tokens: tokenizer.add_special_tokens({'additional_special_tokens': [new_lang]}) tokenizer.lang_code_to_id = getattr(tokenizer, 'lang_code_to_id', {}) tokenizer.id_to_lang_code = getattr(tokenizer, 'id_to_lang_code', {}) if new_lang not in tokenizer.lang_code_to_id: new_lang_id = tokenizer.convert_tokens_to_ids(new_lang) tokenizer.lang_code_to_id[new_lang] = new_lang_id tokenizer.id_to_lang_code[new_lang_id] = new_lang return tokenizer # Initialize tokenizer with Moore language fix_tokenizer(tokenizer, 'moor_Latn') # Translation function def translate(text, src_lang='fr_Latn', tgt_lang='moor_Latn', a=32, b=3, max_input_length=1024, num_beams=4, **kwargs): tokenizer.src_lang = src_lang tokenizer.tgt_lang = tgt_lang inputs = tokenizer(text, return_tensors='pt', padding=True, truncation=True, max_length=max_input_length) result = model.generate( **inputs.to(model.device), forced_bos_token_id=tokenizer.convert_tokens_to_ids(tgt_lang), max_new_tokens=int(a + b * inputs.input_ids.shape[1]), num_beams=num_beams, **kwargs ) return tokenizer.batch_decode(result, skip_special_tokens=True) # Example usage french_text = "Je suis né à Ouagadougou. J'ai demenagé à Banfora pour mes etudes" moore_translation = translate(french_text, 'fr_Latn', 'moor_Latn') print(moore_translation) # Expected output: ['Mam doga Ouadagoou. Mam kẽnga Banfora m sẽn na yɩl n tɩ karem be.'] ``` ### Alternative Translation Function For more flexibility, you can use this enhanced translation function: ```python def translate_v2(text, model, tokenizer, src_lang='fr_Latn', tgt_lang='moor_Latn', max_length='auto', num_beams=4, no_repeat_ngram_size=4, n_out=None, **kwargs): tokenizer.src_lang = src_lang encoded = tokenizer(text, return_tensors="pt", truncation=True, max_length=512) if max_length == 'auto': max_length = int(32 + 2.0 * encoded.input_ids.shape[1]) model.eval() generated_tokens = model.generate( **encoded.to(model.device), forced_bos_token_id=tokenizer.lang_code_to_id[tgt_lang], max_length=max_length, num_beams=num_beams, no_repeat_ngram_size=no_repeat_ngram_size, num_return_sequences=n_out or 1, **kwargs ) out = tokenizer.batch_decode(generated_tokens, skip_special_tokens=True) if isinstance(text, str) and n_out is None: return out[0] return out ``` ## Training This model was trained using the [SaChi training framework](https://github.com/sawadogosalif/SaChi) with the following parameters: ```yaml model: name: "facebook/nllb-200-distilled-600M" save_path: "./models/nllb-moore-finetuned" new_lang_code: "moore_open" training: batch_size: 16 num_epochs: 3 learning_rate: 1e-4 warmup_steps: 1000 max_length: 128 accumulation_steps: 1 eval_steps: 1000 save_steps: 5000 early_stopping_patience: 5 fp16: true resume_from: null max_grad_norm: 1.0 data: dataset_name: "sawadogosalif/MooreFRCollections" train_size: 0.8 test_size: 0.1 val_size: 0.1 random_seed: 2025 src_col: "source" tgt_col: "target" src_lang_col: "french" tgt_lang_col: "moore" evaluation: num_samples: 10 num_beams: 5 no_repeat_ngram_size: 3 ``` The training was completed in approximately 2 hours on a T4 GPU for 3 epochs. ## Dataset This model was trained on the [MooreFRCollections](https://huggingface.co/datasets/sawadogosalif/MooreFRCollections) dataset, which contains parallel texts in French and Moore languages. ## Limitations - The model performs best with standard French input text. - Performance may vary with highly technical, specialized, or colloquial language. - The model may not handle certain Moore dialectal variations perfectly. ## Source Code The training code is available in the [SaChi repository](https://github.com/sawadogosalif/SaChi). ## Citation ``` @misc{author = {Sawadogo, Salif}, title = {MooreFR-SaChi-translationv0}, year = {202}, publisher = {Hugging Face}, howpublished = {\url{https://huggingface.co/sawadogosalif/MooreFR-SaChi-translationv0}} } ```