---
language:
- cs
- pl
- sk
- sl
- en
library_name: transformers
license: cc-by-4.0
tags:
- translation
- mt
- marian
- pytorch
- sentence-piece
- multilingual
- allegro
- laniqo
pipeline_tag: translation
---
# MultiSlav BiDi Models
## Multilingual BiDi MT Models
___BiDi___ is a collection of Encoder-Decoder vanilla transformer models trained on sentence-level Machine Translation task.
Each model is supporting Bi-Directional translation. More information is available in our [MultiSlav paper](https://hf.co/papers/2502.14509).
___BiDi___ models are part of the [___MultiSlav___ collection](https://huggingface.co/collections/allegro/multislav-6793d6b6419e5963e759a683).
Experiments were conducted under research project by [Machine Learning Research](https://ml.allegro.tech/) lab for [Allegro.com](https://ml.allegro.tech/).
Big thanks to [laniqo.com](laniqo.com) for cooperation in the research.
Graphic above provides an example of an BiDi model - [BiDi-ces-pol](https://huggingface.co/allegro/bidi-ces-pol) to translate from Polish to Czech language.
___BiDi-ces-pol___ is a bi-directional model supporting translation both __form Czech to Polish__ and __from Polish to Czech__ directions.
### Supported languages
To use a ___BiDi___ model, you must provide the target language for translation.
Target language tokens are represented as 3-letter ISO 639-3 language codes embedded in a format >>xxx<<.
All accepted directions and their respective tokens are listed below.
Note that, for each model only two directions are available.
Each of them was added as a special token to Sentence-Piece tokenizer.
| **Target Language** | **First token** |
|---------------------|-----------------|
| Czech | `>>ces<<` |
| English | `>>eng<<` |
| Polish | `>>pol<<` |
| Slovak | `>>slk<<` |
| Slovene | `>>slv<<` |
### Bi-Di models available
We provided 10 ___BiDi___ models, allowing to translate between 20 languages.
| **Bi-Di model** | **Languages supported** | **HF repository** |
|-----------------|-------------------------|---------------------------------------------------------------------|
| BiDi-ces-eng | Czech ↔ English | [allegro/BiDi-ces-eng](https://huggingface.co/allegro/bidi-ces-eng) |
| BiDi-ces-pol | Czech ↔ Polish | [allegro/BiDi-ces-pol](https://huggingface.co/allegro/bidi-ces-pol) |
| BiDi-ces-slk | Czech ↔ Slovak | [allegro/BiDi-ces-slk](https://huggingface.co/allegro/bidi-ces-slk) |
| BiDi-ces-slv | Czech ↔ Slovene | [allegro/BiDi-ces-slv](https://huggingface.co/allegro/bidi-ces-slv) |
| BiDi-eng-pol | English ↔ Polish | [allegro/BiDi-eng-pol](https://huggingface.co/allegro/bidi-eng-pol) |
| BiDi-eng-slk | English ↔ Slovak | [allegro/BiDi-eng-slk](https://huggingface.co/allegro/bidi-eng-slk) |
| BiDi-eng-slv | English ↔ Slovene | [allegro/BiDi-eng-slv](https://huggingface.co/allegro/bidi-eng-slv) |
| BiDi-pol-slk | Polish ↔ Slovak | [allegro/BiDi-pol-slk](https://huggingface.co/allegro/bidi-pol-slk) |
| BiDi-pol-slv | Polish ↔ Slovene | [allegro/BiDi-pol-slv](https://huggingface.co/allegro/bidi-pol-slv) |
| BiDi-slk-slv | Slovak ↔ Slovene | [allegro/BiDi-slk-slv](https://huggingface.co/allegro/bidi-slk-slv) |
## Use case quickstart
Example code-snippet to use model. Due to bug the `MarianMTModel` must be used explicitly.
Remember to adjust source and target languages to your use-case.
```python
from transformers import AutoTokenizer, MarianMTModel
source_lang = "pol"
target_lang = "ces"
first_lang, second_lang = sorted([source_lang, target_lang])
model_name = f"Allegro/BiDi-{first_lang}-{second_lang}"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = MarianMTModel.from_pretrained(model_name)
text = f">>{target_lang}<<" + " " + "Allegro to internetowa platforma e-commerce, na której swoje produkty sprzedają średnie i małe firmy, jak również duże marki."
batch_to_translate = [text]
translations = model.generate(**tokenizer.batch_encode_plus(batch_to_translate, return_tensors="pt"))
decoded_translation = tokenizer.batch_decode(translations, skip_special_tokens=True, clean_up_tokenization_spaces=True)[0]
print(decoded_translation)
```
Generated Czech output:
> Allegro je online e-commerce platforma, na které své výrobky prodávají střední a malé firmy, stejně jako velké značky.
## Training
[SentencePiece](https://github.com/google/sentencepiece) tokenizer has a vocab size 32k in total (16k per language). Tokenizer was trained on randomly sampled part of the training corpus.
During the training we used the [MarianNMT](https://marian-nmt.github.io/) framework.
Base marian configuration used: [transfromer-big](https://github.com/marian-nmt/marian-dev/blob/master/src/common/aliases.cpp#L113).
All training parameters are listed in table below.
### Training hyperparameters:
## Training corpora
## Evaluation
## Limitations and Biases
## License
## Citation
TO BE UPDATED SOON 🤗
## Contact Options