rewicks
/

baseline_en-de_64k_ep1

text2text-generation

machine translation

Model card Files Files and versions Community

baseline_en-de_64k_ep1 / README.md

rewicks's picture

Update README.md

8279ea6 verified 7 months ago

|

history blame contribute delete

3.2 kB

	---
	library_name: transformers
	tags:
	- machine translation
	- english-german
	- english
	- german
	- bilingual
	license: apache-2.0
	datasets:
	- rewicks/english-german-data
	language:
	- en
	- de
	pipeline_tag: translation
	---

	# Model Card for Model ID

	This model is a simple bilingual English-German machine translation trained with [MarianNMT](https://marian-nmt.github.io/).
	They were converted to huggingface using [scripts](https://huggingface.co/Helsinki-NLP/opus-mt-en-zh/discussions/1) derived from the Helsinki-NLP group.
	We collected most datasets listed via [mtdata](https://github.com/thammegowda/mtdata) and filtered.
	The [processed data](https://huggingface.co/datasets/rewicks/english-german-data) is also available on huggingface.

	We trained these models in order to develop a new ensembling algorithm.
	Agreement-Based Ensembling is an inference-time-only algorithm that allows for ensembling models with different vocabularies, without the ned to learn additional parameters or alter the underlying models.
	Instead, the algorithm ensures that tokens generated by the ensembled models _agree_ in their surface form.
	For more information, please check out [our code available on GitHub](https://github.com/mjpost/ensemble24), or read our paper on Arxiv.


	## Model Details

	### Model Description

	<!-- Provide a longer summary of what this model is. -->

	This is the model card of a 🤗 transformers model that has been pushed on the Hub. This model card has been automatically generated.

	- Shared, Developed by: Rachel Wicks
	- Funded By: Johns Hopkins University
	- Model type: Encoder-Decoder (Transformer, Transformer)
	- Language(s) (NLP): English, German
	- License: Apache 2.0

	### Model Sources [optional]

	<!-- Provide the basic links for the model. -->

	- Paper [optional]: Coming Soon!



	## How to Get Started with the Model

	The code below can be used to translate lines read from standard input (our baseline in our paper).

	```
	import sys
	import torch

	from transformers import AutoTokenizer, AutoModelForSeq2SeqLM

	model_id = sys.argv[1]

	device = "cuda" if torch.cuda.is_available() else "cpu"

	tokenizer = AutoTokenizer.from_pretrained(model_id, torch_dtype=torch.bfloat16)
	model = AutoModelForSeq2SeqLM.from_pretrained(model_id).to(device)
	model = model.eval()

	for line in sys.stdin:
	line = line.strip()
	inputs = tokenizer(line, return_tensors="pt").to(device)
	translated_tokens = model.generate(
	**inputs, max_length=256,
	num_beams = 5,
	)
	print(tokenizer.batch_decode(translated_tokens, skip_special_tokens=True)[0])
	```

	## Training Details

	Data is available [here](https://huggingface.co/datasets/rewicks/english-german-data).
	We use [sotastream](https://pypi.org/project/sotastream/) to stream data over stdin.
	We use [MarianNMT](https://marian-nmt.github.io/) to train.
	The config is available in the repo as `config.yml`.


	## Evaluation

	BLEU on WMT24 is XX.

	#### Hardware

	RTX Titan (24GB)

	## Citation [optional]

	<!-- If there is a paper or blog post introducing the model, the APA and Bibtex information for that should go in this section. -->

	BibTeX:

	[More Information Needed]

	APA: