|
--- |
|
license: mit |
|
--- |
|
# PaLe-MADLAD |
|
|
|
The MADLAD-400 model fine-tuned to translate from Proper Karelian, Livvi, Ludian, and Veps to Russian and vice versa. We call this model **Pa**ragraph-**Le**vel as we trained it on paragraphs comprising multiple sentences. The model demonstrates the capacity to handle gender-neutral pronouns (presenting a major obstacle in translating from Finno-Ugric languages) and other discourse-level phenomena. |
|
|
|
## Example Usage for Inference |
|
|
|
```` |
|
from transformers import AutoModelForSeq2SeqLM, AutoTokenizer |
|
|
|
model = AutoModelForSeq2SeqLM.from_pretrained('tartuNLP/pale-madlad-mt') |
|
tokenizer = AutoTokenizer.from_pretrained('tartuNLP/pale-madlad-mt') |
|
|
|
# You need to explicitly prepend a target language tag to the input string in the format <2xx>, where xx stands for the language code. |
|
# Language codes: 'krl' for Proper Karelian, 'lud' for Ludian, 'olo' for Livvi, 'vep' for Veps, 'ru' for Russian, 'en' for English. |
|
text = '<2krl>' + 'Здравствуйте!' |
|
|
|
inputs = tokenizer(text, return_tensors='pt').input_ids |
|
outputs = model.generate(inputs) |
|
tokenizer.decode(outputs[0], skip_special_tokens=True) |
|
# Output: Terveh! |
|
```` |
|
|
|
Please cite the following paper if you use this model in your work: |
|
``` |
|
@inproceedings{ |
|
pashchenko2024paragraphlevel, |
|
title={Paragraph-Level Machine Translation for Low-Resource Finno-Ugric Languages}, |
|
author={Dmytro Pashchenko and Lisa Yankovskaya and Mark Fishel}, |
|
booktitle={The Joint 25th Nordic Conference on Computational Linguistics and 11th Baltic Conference on Human Language Technologies}, |
|
year={2024}, |
|
url={https://openreview.net/forum?id=uTFJsQpNZk} |
|
} |
|
``` |