--- language: - cs - pl - sk - sl - en library_name: transformers license: cc-by-4.0 tags: - translation - mt - marian - pytorch - sentence-piece - multilingual - allegro - laniqo pipeline_tag: translation --- # MultiSlav BiDi Models

## Multilingual BiDi MT Models ___BiDi___ is a collection of Encoder-Decoder vanilla transformer models trained on sentence-level Machine Translation task. Each model is supporting Bi-Directional translation. More information is available in our [MultiSlav paper](https://hf.co/papers/2502.14509). ___BiDi___ models are part of the [___MultiSlav___ collection](https://huggingface.co/collections/allegro/multislav-6793d6b6419e5963e759a683). Experiments were conducted under research project by [Machine Learning Research](https://ml.allegro.tech/) lab for [Allegro.com](https://ml.allegro.tech/). Big thanks to [laniqo.com](laniqo.com) for cooperation in the research.

Graphic above provides an example of an BiDi model - [BiDi-ces-pol](https://huggingface.co/allegro/bidi-ces-pol) to translate from Polish to Czech language. ___BiDi-ces-pol___ is a bi-directional model supporting translation both __form Czech to Polish__ and __from Polish to Czech__ directions. ### Supported languages To use a ___BiDi___ model, you must provide the target language for translation. Target language tokens are represented as 3-letter ISO 639-3 language codes embedded in a format >>xxx<<. All accepted directions and their respective tokens are listed below. Note that, for each model only two directions are available. Each of them was added as a special token to Sentence-Piece tokenizer. | **Target Language** | **First token** | |---------------------|-----------------| | Czech | `>>ces<<` | | English | `>>eng<<` | | Polish | `>>pol<<` | | Slovak | `>>slk<<` | | Slovene | `>>slv<<` | ### Bi-Di models available We provided 10 ___BiDi___ models, allowing to translate between 20 languages. | **Bi-Di model** | **Languages supported** | **HF repository** | |-----------------|-------------------------|---------------------------------------------------------------------| | BiDi-ces-eng | Czech ↔ English | [allegro/BiDi-ces-eng](https://huggingface.co/allegro/bidi-ces-eng) | | BiDi-ces-pol | Czech ↔ Polish | [allegro/BiDi-ces-pol](https://huggingface.co/allegro/bidi-ces-pol) | | BiDi-ces-slk | Czech ↔ Slovak | [allegro/BiDi-ces-slk](https://huggingface.co/allegro/bidi-ces-slk) | | BiDi-ces-slv | Czech ↔ Slovene | [allegro/BiDi-ces-slv](https://huggingface.co/allegro/bidi-ces-slv) | | BiDi-eng-pol | English ↔ Polish | [allegro/BiDi-eng-pol](https://huggingface.co/allegro/bidi-eng-pol) | | BiDi-eng-slk | English ↔ Slovak | [allegro/BiDi-eng-slk](https://huggingface.co/allegro/bidi-eng-slk) | | BiDi-eng-slv | English ↔ Slovene | [allegro/BiDi-eng-slv](https://huggingface.co/allegro/bidi-eng-slv) | | BiDi-pol-slk | Polish ↔ Slovak | [allegro/BiDi-pol-slk](https://huggingface.co/allegro/bidi-pol-slk) | | BiDi-pol-slv | Polish ↔ Slovene | [allegro/BiDi-pol-slv](https://huggingface.co/allegro/bidi-pol-slv) | | BiDi-slk-slv | Slovak ↔ Slovene | [allegro/BiDi-slk-slv](https://huggingface.co/allegro/bidi-slk-slv) | ## Use case quickstart Example code-snippet to use model. Due to bug the `MarianMTModel` must be used explicitly. Remember to adjust source and target languages to your use-case. ```python from transformers import AutoTokenizer, MarianMTModel source_lang = "pol" target_lang = "ces" first_lang, second_lang = sorted([source_lang, target_lang]) model_name = f"Allegro/BiDi-{first_lang}-{second_lang}" tokenizer = AutoTokenizer.from_pretrained(model_name) model = MarianMTModel.from_pretrained(model_name) text = f">>{target_lang}<<" + " " + "Allegro to internetowa platforma e-commerce, na której swoje produkty sprzedają średnie i małe firmy, jak również duże marki." batch_to_translate = [text] translations = model.generate(**tokenizer.batch_encode_plus(batch_to_translate, return_tensors="pt")) decoded_translation = tokenizer.batch_decode(translations, skip_special_tokens=True, clean_up_tokenization_spaces=True)[0] print(decoded_translation) ``` Generated Czech output: > Allegro je online e-commerce platforma, na které své výrobky prodávají střední a malé firmy, stejně jako velké značky. ## Training [SentencePiece](https://github.com/google/sentencepiece) tokenizer has a vocab size 32k in total (16k per language). Tokenizer was trained on randomly sampled part of the training corpus. During the training we used the [MarianNMT](https://marian-nmt.github.io/) framework. Base marian configuration used: [transfromer-big](https://github.com/marian-nmt/marian-dev/blob/master/src/common/aliases.cpp#L113). All training parameters are listed in table below. ### Training hyperparameters: ## Training corpora ## Evaluation ## Limitations and Biases ## License ## Citation TO BE UPDATED SOON 🤗 ## Contact Options