opus-mt-tc-bible-big-deu_eng_fra_por_spa-gmq

Table of Contents

Model Details

Neural machine translation model for translating from unknown (deu+eng+fra+por+spa) to North Germanic languages (gmq).

This model is part of the OPUS-MT project, an effort to make neural machine translation models widely available and accessible for many languages in the world. All models are originally trained using the amazing framework of Marian NMT, an efficient NMT implementation written in pure C++. The models have been converted to pyTorch using the transformers library by huggingface. Training data is taken from OPUS and training pipelines use the procedures of OPUS-MT-train. Model Description:

This is a multilingual translation model with multiple target languages. A sentence initial language token is required in the form of >>id<< (id = valid target language ID), e.g. >>dan<<

Uses

This model can be used for translation and text-to-text generation.

Risks, Limitations and Biases

CONTENT WARNING: Readers should be aware that the model is trained on various public data sets that may contain content that is disturbing, offensive, and can propagate historical and current stereotypes.

Significant research has explored bias and fairness issues with language models (see, e.g., Sheng et al. (2021) and Bender et al. (2021)).

How to Get Started With the Model

A short example code:

from transformers import MarianMTModel, MarianTokenizer

src_text = [
    ">>dan<< Replace this with text in an accepted source language.",
    ">>swe<< This is the second sentence."
]

model_name = "pytorch-models/opus-mt-tc-bible-big-deu_eng_fra_por_spa-gmq"
tokenizer = MarianTokenizer.from_pretrained(model_name)
model = MarianMTModel.from_pretrained(model_name)
translated = model.generate(**tokenizer(src_text, return_tensors="pt", padding=True))

for t in translated:
    print( tokenizer.decode(t, skip_special_tokens=True) )

You can also use OPUS-MT models with the transformers pipelines, for example:

from transformers import pipeline
pipe = pipeline("translation", model="Helsinki-NLP/opus-mt-tc-bible-big-deu_eng_fra_por_spa-gmq")
print(pipe(">>dan<< Replace this with text in an accepted source language."))

Training

Evaluation

langpair testset chr-F BLEU #sent #words
deu-dan tatoeba-test-v2021-08-07 0.74051 57.8 9998 74644
deu-isl tatoeba-test-v2021-08-07 0.61256 31.7 969 5951
deu-nob tatoeba-test-v2021-08-07 0.71413 52.9 3525 31978
deu-nor tatoeba-test-v2021-08-07 0.71253 52.7 3651 32928
deu-swe tatoeba-test-v2021-08-07 0.72650 58.2 3410 22701
eng-dan tatoeba-test-v2021-08-07 0.74708 60.6 10795 79385
eng-fao tatoeba-test-v2021-08-07 0.48304 29.0 294 1933
eng-isl tatoeba-test-v2021-08-07 0.58312 33.2 2503 19023
eng-nno tatoeba-test-v2021-08-07 0.62606 42.7 460 3428
eng-nob tatoeba-test-v2021-08-07 0.72340 57.4 4539 36119
eng-nor tatoeba-test-v2021-08-07 0.71514 56.2 5000 39552
eng-swe tatoeba-test-v2021-08-07 0.73720 60.5 10362 68067
fra-dan tatoeba-test-v2021-08-07 0.78018 64.1 1731 11312
fra-nob tatoeba-test-v2021-08-07 0.74252 59.1 323 2175
fra-nor tatoeba-test-v2021-08-07 0.74407 60.3 477 3097
fra-swe tatoeba-test-v2021-08-07 0.75644 62.1 1407 9170
por-dan tatoeba-test-v2021-08-07 0.79528 65.6 873 5258
por-nor tatoeba-test-v2021-08-07 0.73559 58.0 481 4030
por-swe tatoeba-test-v2021-08-07 0.75566 60.2 320 1938
spa-dan tatoeba-test-v2021-08-07 0.73310 57.7 5000 35937
spa-isl tatoeba-test-v2021-08-07 0.52169 18.7 238 1220
spa-nob tatoeba-test-v2021-08-07 0.76501 60.9 885 6762
spa-nor tatoeba-test-v2021-08-07 0.75815 60.1 960 7217
spa-swe tatoeba-test-v2021-08-07 0.74222 60.7 1351 8357
deu-dan flores101-devtest 0.62006 34.8 1012 24638
deu-isl flores101-devtest 0.48236 18.8 1012 22834
deu-swe flores101-devtest 0.61778 33.7 1012 23121
eng-swe flores101-devtest 0.69435 45.5 1012 23121
fra-dan flores101-devtest 0.61019 34.0 1012 24638
fra-isl flores101-devtest 0.47647 18.1 1012 22834
fra-swe flores101-devtest 0.60354 32.2 1012 23121
por-isl flores101-devtest 0.47937 19.1 1012 22834
por-swe flores101-devtest 0.60857 33.1 1012 23121
spa-dan flores101-devtest 0.54890 24.4 1012 24638
spa-nob flores101-devtest 0.50610 18.3 1012 23873
spa-swe flores101-devtest 0.54011 22.4 1012 23121
deu-dan flores200-devtest 0.62152 35.1 1012 24638
deu-isl flores200-devtest 0.48648 19.1 1012 22834
deu-nno flores200-devtest 0.53530 24.0 1012 24316
deu-nob flores200-devtest 0.55748 25.1 1012 23873
deu-swe flores200-devtest 0.62138 34.2 1012 23121
eng-dan flores200-devtest 0.70321 47.0 1012 24638
eng-isl flores200-devtest 0.52585 24.4 1012 22834
eng-nno flores200-devtest 0.61372 33.8 1012 24316
eng-nob flores200-devtest 0.62508 34.4 1012 23873
eng-swe flores200-devtest 0.69703 46.0 1012 23121
fra-dan flores200-devtest 0.61025 34.1 1012 24638
fra-isl flores200-devtest 0.48273 18.8 1012 22834
fra-nno flores200-devtest 0.53032 24.3 1012 24316
fra-nob flores200-devtest 0.54933 25.0 1012 23873
fra-swe flores200-devtest 0.60612 32.8 1012 23121
por-dan flores200-devtest 0.62221 36.2 1012 24638
por-isl flores200-devtest 0.48357 19.6 1012 22834
por-nno flores200-devtest 0.54369 26.3 1012 24316
por-nob flores200-devtest 0.56054 26.4 1012 23873
por-swe flores200-devtest 0.61388 34.1 1012 23121
spa-dan flores200-devtest 0.55091 24.7 1012 24638
spa-isl flores200-devtest 0.44469 14.2 1012 22834
spa-nno flores200-devtest 0.48898 18.6 1012 24316
spa-nob flores200-devtest 0.50901 18.8 1012 23873
spa-swe flores200-devtest 0.54182 22.7 1012 23121
eng-isl newstest2021 0.51196 21.9 1000 25233
deu-dan ntrex128 0.56412 29.1 1997 47643
deu-isl ntrex128 0.48309 18.8 1997 46643
deu-nno ntrex128 0.51535 22.0 1997 46512
deu-nob ntrex128 0.56152 27.6 1997 45501
deu-swe ntrex128 0.58061 29.6 1997 44889
eng-dan ntrex128 0.61894 37.6 1997 47643
eng-isl ntrex128 0.52027 23.9 1997 46643
eng-nno ntrex128 0.60754 34.0 1997 46512
eng-nob ntrex128 0.62327 36.9 1997 45501
eng-swe ntrex128 0.66129 41.3 1997 44889
fra-dan ntrex128 0.54102 27.1 1997 47643
fra-isl ntrex128 0.47296 18.4 1997 46643
fra-nno ntrex128 0.50532 21.6 1997 46512
fra-nob ntrex128 0.54026 25.7 1997 45501
fra-swe ntrex128 0.56278 27.9 1997 44889
por-dan ntrex128 0.56288 30.0 1997 47643
por-isl ntrex128 0.47577 17.8 1997 46643
por-nno ntrex128 0.52158 23.0 1997 46512
por-nob ntrex128 0.55788 27.4 1997 45501
por-swe ntrex128 0.57790 29.3 1997 44889
spa-dan ntrex128 0.55607 27.5 1997 47643
spa-isl ntrex128 0.48566 18.4 1997 46643
spa-nno ntrex128 0.51741 22.2 1997 46512
spa-nob ntrex128 0.55824 26.8 1997 45501
spa-swe ntrex128 0.57851 28.8 1997 44889

Citation Information

@article{tiedemann2023democratizing,
  title={Democratizing neural machine translation with {OPUS-MT}},
  author={Tiedemann, J{\"o}rg and Aulamo, Mikko and Bakshandaeva, Daria and Boggia, Michele and Gr{\"o}nroos, Stig-Arne and Nieminen, Tommi and Raganato, Alessandro and Scherrer, Yves and Vazquez, Raul and Virpioja, Sami},
  journal={Language Resources and Evaluation},
  number={58},
  pages={713--755},
  year={2023},
  publisher={Springer Nature},
  issn={1574-0218},
  doi={10.1007/s10579-023-09704-w}
}

@inproceedings{tiedemann-thottingal-2020-opus,
    title = "{OPUS}-{MT} {--} Building open translation services for the World",
    author = {Tiedemann, J{\"o}rg  and Thottingal, Santhosh},
    booktitle = "Proceedings of the 22nd Annual Conference of the European Association for Machine Translation",
    month = nov,
    year = "2020",
    address = "Lisboa, Portugal",
    publisher = "European Association for Machine Translation",
    url = "https://aclanthology.org/2020.eamt-1.61",
    pages = "479--480",
}

@inproceedings{tiedemann-2020-tatoeba,
    title = "The Tatoeba Translation Challenge {--} Realistic Data Sets for Low Resource and Multilingual {MT}",
    author = {Tiedemann, J{\"o}rg},
    booktitle = "Proceedings of the Fifth Conference on Machine Translation",
    month = nov,
    year = "2020",
    address = "Online",
    publisher = "Association for Computational Linguistics",
    url = "https://aclanthology.org/2020.wmt-1.139",
    pages = "1174--1182",
}

Acknowledgements

The work is supported by the HPLT project, funded by the European Union’s Horizon Europe research and innovation programme under grant agreement No 101070350. We are also grateful for the generous computational resources and IT infrastructure provided by CSC -- IT Center for Science, Finland, and the EuroHPC supercomputer LUMI.

Model conversion info

  • transformers version: 4.45.1
  • OPUS-MT git hash: 0882077
  • port time: Tue Oct 8 10:03:29 EEST 2024
  • port machine: LM0-400-22516.local
Downloads last month
4
Safetensors
Model size
234M params
Tensor type
F32
·
Inference Examples
This model does not have enough activity to be deployed to Inference API (serverless) yet. Increase its social visibility and check back later, or deploy to Inference Endpoints (dedicated) instead.

Collection including Helsinki-NLP/opus-mt-tc-bible-big-deu_eng_fra_por_spa-gmq

Evaluation results