TIM-UNIGE Multilingual Aranese Model (WMT24)

This model was submitted to the WMT24 Shared Task on Translation into Low-Resource Languages of Spain. It is a multilingual translation model that translates from Spanish into Aranese and Occitan, fine-tuned from facebook/nllb-200-distilled-600M.

๐Ÿง  Model Description

  • Architecture: NLLB (600M distilled)
  • Fine-tuned with a multilingual multistage approach
  • Includes transfer from Occitan to improve Aranese translation
  • Supports Aranese and Occitan via the oci_Latn language tag
  • Optional special tokens <arn> / <oci> used in training to distinguish the targets

๐Ÿ“Š Performance

Evaluated on FLORES+ test set:

Language BLEU ChrF TER
Aranese 30.1 49.8 71.5
Aragonese 61.9 79.5 26.8
  • Spanish->Aranese outperforms the Apertium baseline by +1.3 BLEU.
  • Spanish->Aragonese outperforms the Apertium baseline by +0.8 BLEU.

๐Ÿ—‚๏ธ Training Data

  • Real parallel data: OPUS, PILAR (Occitan, Aranese)
  • Synthetic data:
    • BLOOMZ-generated Aranese sentences (~59k)
    • Forward and backtranslations using Apertium
  • Final fine-tuning: FLORES+ dev set (997 segments)

๐Ÿ› ๏ธ Multilingual Training Setup

We trained the model on Spanishโ€“Occitan and Spanishโ€“Aranese jointly, using:

  • oci_Latn as the shared language tag
  • Or, a special token prefix such as <arn> or <oci> to distinguish them

๐Ÿš€ Quick Example (Spanish โ†’ Aranese)

from transformers import AutoModelForSeq2SeqLM, AutoTokenizer
import torch

# Load model and tokenizer
model_name = "jonathanmutal/WMT24-spanish-to-aranese-aragonese"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForSeq2SeqLM.from_pretrained(model_name)

# Input in Spanish
spanish_sentence = "ยฟCรณmo se encuentra usted hoy?"

# Tokenize input
inputs = tokenizer(spanish_sentence, return_tensors="pt")

# Target language: Aranese uses 'oci_Latn' in NLLB
translated_tokens = model.generate(
    **inputs,
    forced_bos_token_id=tokenizer.convert_tokens_to_ids("oci_Latn"),
    max_length=50,
    num_beams=5
)

# Decode and print output
translation = tokenizer.batch_decode(translated_tokens, skip_special_tokens=True)[0]
print(translation)

Example output:

Com se trape vos auรฉ?

๐Ÿ” Intended Uses

  • Translate Spanish texts into Aranese or Occitan
  • Research in low-resource multilingual MT
  • Applications for language revitalization or public health communication

โš ๏ธ Limitations

  • Aranese corpora remain extremely small
  • If you use the same oci_Latn token for both Occitan and Aranese, disambiguation may require special prompts
  • Orthographic inconsistency or dialect variation may affect quality

๐Ÿ“š Citation

@inproceedings{mutal2024timunige,
  title = "{TIM-UNIGE}: Translation into Low-Resource Languages of Spain for {WMT24}",
  author = {Mutal, Jonathan and Ormaechea, Lucรญa},
  booktitle = "Proceedings of the Ninth Conference on Machine Translation",
  year = {2024},
  pages = {862--870}
}

๐Ÿ‘ฅ Authors

Downloads last month
19
Safetensors
Model size
615M params
Tensor type
F32
ยท
Inference Providers NEW
This model isn't deployed by any Inference Provider. ๐Ÿ™‹ Ask for provider support

Evaluation results