File size: 6,835 Bytes
6e74375
47d9d7a
 
 
3b0982e
 
 
 
6e74375
58fabde
6e74375
58fabde
6e74375
58fabde
6e74375
58fabde
6e74375
58fabde
 
 
 
 
 
 
 
 
6e74375
58fabde
 
 
9c8ef39
6e74375
9c8ef39
 
 
 
 
 
6e74375
9c8ef39
 
 
 
6e74375
9c8ef39
 
 
6e74375
9c8ef39
 
6e74375
9c8ef39
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
47d9d7a
9c8ef39
47d9d7a
 
58fabde
47d9d7a
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
9c8ef39
 
47d9d7a
9c8ef39
 
47d9d7a
 
 
9c8ef39
 
47d9d7a
9c8ef39
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
---
language:
- en
- pt
library_name: tf-keras
license: apache-2.0
tags:
- translation
---
# GRU-eng-por

## Model Overview

The GRU-eng-por model is a Recurrent Neural Network (RNN) for machine translation tasks (English-Portuguese).

### Details

- **Size:** 42,554,912  parameters
- **Model type:** Recurrent neural network
- **Optimizer**: `rmsprop`
- **Number of Epochs:** 15
- **Dimensionality of the embedding layer** = 256
- **dimensionality of the feed-forward network** = 1024
- **Hardware:** Tesla T4
- **Emissions:** Not measured
- **Total Energy Consumption:** Not measured

### How to Use

```python
!pip install huggingface_hub["tensorflow"] -q

from huggingface_hub import from_pretrained_keras
from huggingface_hub import hf_hub_download
import tensorflow as tf
import numpy as np
import string
import re

# Select characters to strip, but preserve the "[" and "]"
strip_chars = string.punctuation
strip_chars = strip_chars.replace("[", "")
strip_chars = strip_chars.replace("]", "")

def custom_standardization(input_string):
    lowercase = tf.strings.lower(input_string)
    return tf.strings.regex_replace(lowercase, f"[{re.escape(strip_chars)}]", "")

# Load the `seq2seq_rnn` from the Hub
seq2seq_rnn = from_pretrained_keras("AiresPucrs/GRU-eng-por")

# Load the portuguese vocabulary
portuguese_vocabulary_path = hf_hub_download(
    repo_id="AiresPucrs/GRU-eng-por",
    filename="portuguese_vocabulary.txt",
    repo_type='model',
    local_dir="./")

# Load the english vocabulary
english_vocabulary_path = hf_hub_download(
    repo_id="AiresPucrs/GRU-eng-por",
    filename="english_vocabulary.txt",
    repo_type='model',
    local_dir="./")

with open(portuguese_vocabulary_path, encoding='utf-8',  errors='backslashreplace') as fp:
    portuguese_vocab = [line.strip() for line in fp]
    fp.close()

with open(english_vocabulary_path, encoding='utf-8',  errors='backslashreplace') as fp:
    english_vocab = [line.strip() for line in fp]
    fp.close()

# Initialize the vectorizers with the learned vocabularies
target_vectorization = tf.keras.layers.TextVectorization(max_tokens=20000,
                                        output_mode="int",
                                        output_sequence_length=21,
                                        standardize=custom_standardization,
                                        vocabulary=portuguese_vocab)

source_vectorization = tf.keras.layers.TextVectorization(max_tokens=20000,
                                        output_mode="int",
                                        output_sequence_length=20,
                                        vocabulary=english_vocab)

# Create a dictionary from `int`to portuguese words
portuguese_index_lookup = dict(zip(range(len(portuguese_vocab)), portuguese_vocab))
max_decoded_sentence_length = 20

def decode_sequence(input_sentence):
    """
    Decodes a sequence using a trained seq2seq RNN model.

    Args:
        input_sentence (str): the input sentence to be decoded

    Returns:
        decoded_sentence (str): the decoded sentence
            generated by the model
    """
    tokenized_input_sentence = source_vectorization([input_sentence])
    decoded_sentence = "[start]"

    for i in range(max_decoded_sentence_length):
        tokenized_target_sentence = target_vectorization([decoded_sentence])
        next_token_predictions = seq2seq_rnn.predict([tokenized_input_sentence, tokenized_target_sentence], verbose=0)
        sampled_token_index = np.argmax(next_token_predictions[0, i, :])
        sampled_token = portuguese_index_lookup[sampled_token_index]
        decoded_sentence += " " + sampled_token
        if sampled_token == "[end]":
            break
    return decoded_sentence

eng_sentences =["What is its name?",
                "How old are you?",
                "I know you know where Mary is.",
                "We will show Tom.",
                "What do you all do?",
                "Don't do it!"]

for sentence in eng_sentences:
    print(f"English sentence:\n{sentence}")
    print(f'Portuguese translation:\n{decode_sequence(sentence)}')
    print('-' * 50)
```

This will output the following:
```
English sentence:
What is its name?
Portuguese translation:
[start] qual é o nome [end]
--------------------------------------------------
English sentence:
How old are you?
Portuguese translation:
[start] quantos anos você tem [end]
--------------------------------------------------
English sentence:
I know you know where Mary is.
Portuguese translation:
[start] eu sei que você sabe onde maria está [end]
--------------------------------------------------
English sentence:
We will show Tom.
Portuguese translation:
[start] nós vamos tom [end]
--------------------------------------------------
English sentence:
What do you all do?
Portuguese translation:
[start] o que vocês faz [end]
--------------------------------------------------
English sentence:
Don't do it!
Portuguese translation:
[start] não faça isso [end]
--------------------------------------------------
```
## Intended Use

This model was created for research purposes only. Specifically, it was designed to translate sentences from English to Portuguese. 
We do not recommend any application of this model outside this scope.

## Performance Metrics

Accuracy is a crude way to monitor validation-set performance during this task. 
On average, this model correctly predicts words in the Portuguese sentence: 65%.
However, next-token accuracy isn't an excellent metric for machine translation models. 
During inference, you're generating the target sentence from scratch and can't rely on previously generated tokens (a.k.a. 100% correctness does not mean you have a good translator). 
We would likely use "_BLEU scores_" in real-world machine translation applications to evaluate our models.

## Training Data

 [English-portuguese translation](https://www.kaggle.com/datasets/nageshsingh/englishportuguese-translation).

 The dataset consists of a set of English and Portuguese sentences.


## Limitations

Translations are far from perfect. To improve this model, we could:

1. Use a deep stack of recurrent layers for both the encoder and the decoder.
2. Or, we could use an `LSTM` instead of a `GRU`.

In conclusion, we do not recommend using this model in real-world applications. 
It was solely developed for academic and educational purposes.

## Cite as 🤗

```latex
@misc{teenytinycastle,
    doi = {10.5281/zenodo.7112065},
    url = {https://github.com/Nkluge-correa/teeny-tiny_castle},
    author = {Nicholas Kluge Corr{\^e}a},
    title = {Teeny-Tiny Castle},
    year = {2024},
    publisher = {GitHub},
    journal = {GitHub repository},
}
```

## License
The GRU-eng-por is licensed under the Apache License, Version 2.0. See the LICENSE file for more details.