File size: 3,198 Bytes
b723f95
 
8279ea6
 
 
 
 
 
 
 
 
 
 
 
 
b723f95
 
 
 
8279ea6
 
 
 
b723f95
8279ea6
 
 
 
b723f95
 
 
 
 
 
 
 
 
 
8279ea6
 
 
 
 
b723f95
 
 
 
 
8279ea6
b723f95
 
 
 
 
8279ea6
b723f95
8279ea6
 
 
b723f95
8279ea6
b723f95
8279ea6
b723f95
8279ea6
b723f95
8279ea6
 
 
b723f95
8279ea6
 
 
 
 
 
 
 
 
b723f95
8279ea6
b723f95
8279ea6
 
 
 
b723f95
 
 
 
8279ea6
b723f95
 
 
8279ea6
b723f95
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
---
library_name: transformers
tags:
- machine translation
- english-german
- english
- german
- bilingual
license: apache-2.0
datasets:
- rewicks/english-german-data
language:
- en
- de
pipeline_tag: translation
---

# Model Card for Model ID

This model is a simple bilingual English-German machine translation trained with [MarianNMT](https://marian-nmt.github.io/).
They were converted to huggingface using [scripts](https://huggingface.co/Helsinki-NLP/opus-mt-en-zh/discussions/1) derived from the Helsinki-NLP group.
We collected most datasets listed via [mtdata](https://github.com/thammegowda/mtdata) and filtered.
The [processed data](https://huggingface.co/datasets/rewicks/english-german-data) is also available on huggingface.

We trained these models in order to develop a new ensembling algorithm.
**Agreement-Based Ensembling** is an inference-time-only algorithm that allows for ensembling models with different vocabularies, without the ned to learn additional parameters or alter the underlying models.
Instead, the algorithm ensures that tokens generated by the ensembled models _agree_ in their surface form.
For more information, please check out [our code available on GitHub](https://github.com/mjpost/ensemble24), or read our paper on Arxiv.


## Model Details

### Model Description

<!-- Provide a longer summary of what this model is. -->

This is the model card of a 🤗 transformers model that has been pushed on the Hub. This model card has been automatically generated.

- **Shared, Developed by:** Rachel Wicks
- **Funded By:** Johns Hopkins University
- **Model type:** Encoder-Decoder (Transformer, Transformer)
- **Language(s) (NLP):** English, German
- **License:** Apache 2.0

### Model Sources [optional]

<!-- Provide the basic links for the model. -->

- **Paper [optional]:** Coming Soon!



## How to Get Started with the Model

The code below can be used to translate lines read from standard input (our baseline in our paper).

```
import sys
import torch

from transformers import AutoTokenizer, AutoModelForSeq2SeqLM

model_id = sys.argv[1]

device = "cuda" if torch.cuda.is_available() else "cpu"

tokenizer = AutoTokenizer.from_pretrained(model_id, torch_dtype=torch.bfloat16)
model = AutoModelForSeq2SeqLM.from_pretrained(model_id).to(device)
model = model.eval()

for line in sys.stdin:
    line = line.strip()
    inputs = tokenizer(line, return_tensors="pt").to(device)
    translated_tokens = model.generate(
        **inputs, max_length=256,
        num_beams = 5,
    )
    print(tokenizer.batch_decode(translated_tokens, skip_special_tokens=True)[0])
```

## Training Details

Data is available [here](https://huggingface.co/datasets/rewicks/english-german-data).
We use [sotastream](https://pypi.org/project/sotastream/) to stream data over stdin.
We use [MarianNMT](https://marian-nmt.github.io/) to train. 
The config is available in the repo as `config.yml`.


## Evaluation

BLEU on WMT24 is XX.

#### Hardware

RTX Titan (24GB)

## Citation [optional]

<!-- If there is a paper or blog post introducing the model, the APA and Bibtex information for that should go in this section. -->

**BibTeX:**

[More Information Needed]

**APA:**