|
--- |
|
library_name: transformers |
|
tags: |
|
- machine translation |
|
- english-german |
|
- english |
|
- german |
|
- bilingual |
|
license: apache-2.0 |
|
datasets: |
|
- rewicks/english-german-data |
|
language: |
|
- en |
|
- de |
|
pipeline_tag: translation |
|
--- |
|
|
|
# Model Card for Model ID |
|
|
|
This model is a simple bilingual English-German machine translation trained with [MarianNMT](https://marian-nmt.github.io/). |
|
They were converted to huggingface using [scripts](https://huggingface.co/Helsinki-NLP/opus-mt-en-zh/discussions/1) derived from the Helsinki-NLP group. |
|
We collected most datasets listed via [mtdata](https://github.com/thammegowda/mtdata) and filtered. |
|
The [processed data](https://huggingface.co/datasets/rewicks/english-german-data) is also available on huggingface. |
|
|
|
We trained these models in order to develop a new ensembling algorithm. |
|
**Agreement-Based Ensembling** is an inference-time-only algorithm that allows for ensembling models with different vocabularies, without the ned to learn additional parameters or alter the underlying models. |
|
Instead, the algorithm ensures that tokens generated by the ensembled models _agree_ in their surface form. |
|
For more information, please check out [our code available on GitHub](https://github.com/mjpost/ensemble24), or read our paper on Arxiv. |
|
|
|
|
|
## Model Details |
|
|
|
### Model Description |
|
|
|
<!-- Provide a longer summary of what this model is. --> |
|
|
|
This is the model card of a 🤗 transformers model that has been pushed on the Hub. This model card has been automatically generated. |
|
|
|
- **Shared, Developed by:** Rachel Wicks |
|
- **Funded By:** Johns Hopkins University |
|
- **Model type:** Encoder-Decoder (Transformer, Transformer) |
|
- **Language(s) (NLP):** English, German |
|
- **License:** Apache 2.0 |
|
|
|
### Model Sources [optional] |
|
|
|
<!-- Provide the basic links for the model. --> |
|
|
|
- **Paper [optional]:** Coming Soon! |
|
|
|
|
|
|
|
## How to Get Started with the Model |
|
|
|
The code below can be used to translate lines read from standard input (our baseline in our paper). |
|
|
|
``` |
|
import sys |
|
import torch |
|
|
|
from transformers import AutoTokenizer, AutoModelForSeq2SeqLM |
|
|
|
model_id = sys.argv[1] |
|
|
|
device = "cuda" if torch.cuda.is_available() else "cpu" |
|
|
|
tokenizer = AutoTokenizer.from_pretrained(model_id, torch_dtype=torch.bfloat16) |
|
model = AutoModelForSeq2SeqLM.from_pretrained(model_id).to(device) |
|
model = model.eval() |
|
|
|
for line in sys.stdin: |
|
line = line.strip() |
|
inputs = tokenizer(line, return_tensors="pt").to(device) |
|
translated_tokens = model.generate( |
|
**inputs, max_length=256, |
|
num_beams = 5, |
|
) |
|
print(tokenizer.batch_decode(translated_tokens, skip_special_tokens=True)[0]) |
|
``` |
|
|
|
## Training Details |
|
|
|
Data is available [here](https://huggingface.co/datasets/rewicks/english-german-data). |
|
We use [sotastream](https://pypi.org/project/sotastream/) to stream data over stdin. |
|
We use [MarianNMT](https://marian-nmt.github.io/) to train. |
|
The config is available in the repo as `config.yml`. |
|
|
|
|
|
## Evaluation |
|
|
|
BLEU on WMT24 is XX. |
|
|
|
#### Hardware |
|
|
|
RTX Titan (24GB) |
|
|
|
## Citation [optional] |
|
|
|
<!-- If there is a paper or blog post introducing the model, the APA and Bibtex information for that should go in this section. --> |
|
|
|
**BibTeX:** |
|
|
|
[More Information Needed] |
|
|
|
**APA:** |
|
|