--- library_name: transformers tags: - machine translation - english-german - english - german - bilingual license: apache-2.0 datasets: - rewicks/english-german-data language: - en - de pipeline_tag: translation --- # Model Card for Model ID This model is a simple bilingual English-German machine translation trained with [MarianNMT](https://marian-nmt.github.io/). They were converted to huggingface using [scripts](https://huggingface.co/Helsinki-NLP/opus-mt-en-zh/discussions/1) derived from the Helsinki-NLP group. We collected most datasets listed via [mtdata](https://github.com/thammegowda/mtdata) and filtered. The [processed data](https://huggingface.co/datasets/rewicks/english-german-data) is also available on huggingface. We trained these models in order to develop a new ensembling algorithm. **Agreement-Based Ensembling** is an inference-time-only algorithm that allows for ensembling models with different vocabularies, without the ned to learn additional parameters or alter the underlying models. Instead, the algorithm ensures that tokens generated by the ensembled models _agree_ in their surface form. For more information, please check out [our code available on GitHub](https://github.com/mjpost/ensemble24), or read our paper on Arxiv. ## Model Details ### Model Description This is the model card of a 🤗 transformers model that has been pushed on the Hub. This model card has been automatically generated. - **Shared, Developed by:** Rachel Wicks - **Funded By:** Johns Hopkins University - **Model type:** Encoder-Decoder (Transformer, Transformer) - **Language(s) (NLP):** English, German - **License:** Apache 2.0 ### Model Sources [optional] - **Paper [optional]:** Coming Soon! ## How to Get Started with the Model The code below can be used to translate lines read from standard input (our baseline in our paper). ``` import sys import torch from transformers import AutoTokenizer, AutoModelForSeq2SeqLM model_id = sys.argv[1] device = "cuda" if torch.cuda.is_available() else "cpu" tokenizer = AutoTokenizer.from_pretrained(model_id, torch_dtype=torch.bfloat16) model = AutoModelForSeq2SeqLM.from_pretrained(model_id).to(device) model = model.eval() for line in sys.stdin: line = line.strip() inputs = tokenizer(line, return_tensors="pt").to(device) translated_tokens = model.generate( **inputs, max_length=256, num_beams = 5, ) print(tokenizer.batch_decode(translated_tokens, skip_special_tokens=True)[0]) ``` ## Training Details Data is available [here](https://huggingface.co/datasets/rewicks/english-german-data). We use [sotastream](https://pypi.org/project/sotastream/) to stream data over stdin. We use [MarianNMT](https://marian-nmt.github.io/) to train. The config is available in the repo as `config.yml`. ## Evaluation BLEU on WMT24 is XX. #### Hardware RTX Titan (24GB) ## Citation [optional] **BibTeX:** [More Information Needed] **APA:**