|
--- |
|
language: |
|
- en |
|
- pt |
|
library_name: tf-keras |
|
license: apache-2.0 |
|
tags: |
|
- translation |
|
--- |
|
# GRU-eng-por |
|
|
|
## Model Overview |
|
|
|
The GRU-eng-por model is a Recurrent Neural Network (RNN) for machine translation tasks (English-Portuguese). |
|
|
|
### Details |
|
|
|
- **Size:** 42,554,912 parameters |
|
- **Model type:** Recurrent neural network |
|
- **Optimizer**: `rmsprop` |
|
- **Number of Epochs:** 15 |
|
- **Dimensionality of the embedding layer** = 256 |
|
- **dimensionality of the feed-forward network** = 1024 |
|
- **Hardware:** Tesla T4 |
|
- **Emissions:** Not measured |
|
- **Total Energy Consumption:** Not measured |
|
|
|
### How to Use |
|
|
|
```python |
|
!pip install huggingface_hub["tensorflow"] -q |
|
|
|
from huggingface_hub import from_pretrained_keras |
|
from huggingface_hub import hf_hub_download |
|
import tensorflow as tf |
|
import numpy as np |
|
import string |
|
import re |
|
|
|
# Select characters to strip, but preserve the "[" and "]" |
|
strip_chars = string.punctuation |
|
strip_chars = strip_chars.replace("[", "") |
|
strip_chars = strip_chars.replace("]", "") |
|
|
|
def custom_standardization(input_string): |
|
lowercase = tf.strings.lower(input_string) |
|
return tf.strings.regex_replace(lowercase, f"[{re.escape(strip_chars)}]", "") |
|
|
|
# Load the `seq2seq_rnn` from the Hub |
|
seq2seq_rnn = from_pretrained_keras("AiresPucrs/GRU-eng-por") |
|
|
|
# Load the portuguese vocabulary |
|
portuguese_vocabulary_path = hf_hub_download( |
|
repo_id="AiresPucrs/GRU-eng-por", |
|
filename="portuguese_vocabulary.txt", |
|
repo_type='model', |
|
local_dir="./") |
|
|
|
# Load the english vocabulary |
|
english_vocabulary_path = hf_hub_download( |
|
repo_id="AiresPucrs/GRU-eng-por", |
|
filename="english_vocabulary.txt", |
|
repo_type='model', |
|
local_dir="./") |
|
|
|
with open(portuguese_vocabulary_path, encoding='utf-8', errors='backslashreplace') as fp: |
|
portuguese_vocab = [line.strip() for line in fp] |
|
fp.close() |
|
|
|
with open(english_vocabulary_path, encoding='utf-8', errors='backslashreplace') as fp: |
|
english_vocab = [line.strip() for line in fp] |
|
fp.close() |
|
|
|
# Initialize the vectorizers with the learned vocabularies |
|
target_vectorization = tf.keras.layers.TextVectorization(max_tokens=20000, |
|
output_mode="int", |
|
output_sequence_length=21, |
|
standardize=custom_standardization, |
|
vocabulary=portuguese_vocab) |
|
|
|
source_vectorization = tf.keras.layers.TextVectorization(max_tokens=20000, |
|
output_mode="int", |
|
output_sequence_length=20, |
|
vocabulary=english_vocab) |
|
|
|
# Create a dictionary from `int`to portuguese words |
|
portuguese_index_lookup = dict(zip(range(len(portuguese_vocab)), portuguese_vocab)) |
|
max_decoded_sentence_length = 20 |
|
|
|
def decode_sequence(input_sentence): |
|
""" |
|
Decodes a sequence using a trained seq2seq RNN model. |
|
|
|
Args: |
|
input_sentence (str): the input sentence to be decoded |
|
|
|
Returns: |
|
decoded_sentence (str): the decoded sentence |
|
generated by the model |
|
""" |
|
tokenized_input_sentence = source_vectorization([input_sentence]) |
|
decoded_sentence = "[start]" |
|
|
|
for i in range(max_decoded_sentence_length): |
|
tokenized_target_sentence = target_vectorization([decoded_sentence]) |
|
next_token_predictions = seq2seq_rnn.predict([tokenized_input_sentence, tokenized_target_sentence], verbose=0) |
|
sampled_token_index = np.argmax(next_token_predictions[0, i, :]) |
|
sampled_token = portuguese_index_lookup[sampled_token_index] |
|
decoded_sentence += " " + sampled_token |
|
if sampled_token == "[end]": |
|
break |
|
return decoded_sentence |
|
|
|
eng_sentences =["What is its name?", |
|
"How old are you?", |
|
"I know you know where Mary is.", |
|
"We will show Tom.", |
|
"What do you all do?", |
|
"Don't do it!"] |
|
|
|
for sentence in eng_sentences: |
|
print(f"English sentence:\n{sentence}") |
|
print(f'Portuguese translation:\n{decode_sequence(sentence)}') |
|
print('-' * 50) |
|
``` |
|
|
|
This will output the following: |
|
``` |
|
English sentence: |
|
What is its name? |
|
Portuguese translation: |
|
[start] qual é o nome [end] |
|
-------------------------------------------------- |
|
English sentence: |
|
How old are you? |
|
Portuguese translation: |
|
[start] quantos anos você tem [end] |
|
-------------------------------------------------- |
|
English sentence: |
|
I know you know where Mary is. |
|
Portuguese translation: |
|
[start] eu sei que você sabe onde maria está [end] |
|
-------------------------------------------------- |
|
English sentence: |
|
We will show Tom. |
|
Portuguese translation: |
|
[start] nós vamos tom [end] |
|
-------------------------------------------------- |
|
English sentence: |
|
What do you all do? |
|
Portuguese translation: |
|
[start] o que vocês faz [end] |
|
-------------------------------------------------- |
|
English sentence: |
|
Don't do it! |
|
Portuguese translation: |
|
[start] não faça isso [end] |
|
-------------------------------------------------- |
|
``` |
|
## Intended Use |
|
|
|
This model was created for research purposes only. Specifically, it was designed to translate sentences from English to Portuguese. |
|
We do not recommend any application of this model outside this scope. |
|
|
|
## Performance Metrics |
|
|
|
Accuracy is a crude way to monitor validation-set performance during this task. |
|
On average, this model correctly predicts words in the Portuguese sentence: 65%. |
|
However, next-token accuracy isn't an excellent metric for machine translation models. |
|
During inference, you're generating the target sentence from scratch and can't rely on previously generated tokens (a.k.a. 100% correctness does not mean you have a good translator). |
|
We would likely use "_BLEU scores_" in real-world machine translation applications to evaluate our models. |
|
|
|
## Training Data |
|
|
|
[English-portuguese translation](https://www.kaggle.com/datasets/nageshsingh/englishportuguese-translation). |
|
|
|
The dataset consists of a set of English and Portuguese sentences. |
|
|
|
|
|
## Limitations |
|
|
|
Translations are far from perfect. To improve this model, we could: |
|
|
|
1. Use a deep stack of recurrent layers for both the encoder and the decoder. |
|
2. Or, we could use an `LSTM` instead of a `GRU`. |
|
|
|
In conclusion, we do not recommend using this model in real-world applications. |
|
It was solely developed for academic and educational purposes. |
|
|
|
## Cite as 🤗 |
|
|
|
```latex |
|
@misc{teenytinycastle, |
|
doi = {10.5281/zenodo.7112065}, |
|
url = {https://github.com/Nkluge-correa/teeny-tiny_castle}, |
|
author = {Nicholas Kluge Corr{\^e}a}, |
|
title = {Teeny-Tiny Castle}, |
|
year = {2024}, |
|
publisher = {GitHub}, |
|
journal = {GitHub repository}, |
|
} |
|
``` |
|
|
|
## License |
|
The GRU-eng-por is licensed under the Apache License, Version 2.0. See the LICENSE file for more details. |