File size: 1,631 Bytes
5705d33
 
 
4e6ca08
 
e0110fd
5705d33
e384594
e22e7c4
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
f1b088b
5705d33
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
---
license: mit
---
# PaLe-MADLAD

The MADLAD-400 model fine-tuned to translate from Proper Karelian, Livvi, Ludian, and Veps to Russian and vice versa. We call this model **Pa**ragraph-**Le**vel as we trained it on paragraphs comprising multiple sentences. The model demonstrates the capacity to handle gender-neutral pronouns (presenting a major obstacle in translating from Finno-Ugric languages) and other discourse-level phenomena.

## Example Usage for Inference

````
from transformers import AutoModelForSeq2SeqLM, AutoTokenizer

model = AutoModelForSeq2SeqLM.from_pretrained('tartuNLP/pale-madlad-mt')
tokenizer = AutoTokenizer.from_pretrained('tartuNLP/pale-madlad-mt')

# You need to explicitly prepend a target language tag to the input string in the format <2xx>, where xx stands for the language code.
# Language codes: 'krl' for Proper Karelian, 'lud' for Ludian, 'olo' for Livvi, 'vep' for Veps, 'ru' for Russian, 'en' for English.
text = '<2krl>' + 'Здравствуйте!'

inputs = tokenizer(text, return_tensors='pt').input_ids
outputs = model.generate(inputs)
tokenizer.decode(outputs[0], skip_special_tokens=True)
# Output: Terveh!
````

Please cite the following paper if you use this model in your work:
```
@inproceedings{
pashchenko2024paragraphlevel,
title={Paragraph-Level Machine Translation for Low-Resource Finno-Ugric Languages},
author={Dmytro Pashchenko and Lisa Yankovskaya and Mark Fishel},
booktitle={The Joint 25th Nordic Conference on Computational Linguistics and 11th Baltic Conference on Human Language Technologies},
year={2024},
url={https://openreview.net/forum?id=uTFJsQpNZk}
}
```