kambale commited on
Commit
436be25
·
verified ·
1 Parent(s): b3bc53a

Update model v2: 50 epochs, Val Loss 1.803, Test BLEU 39.76, Params 23,363,200

Browse files
README.md ADDED
@@ -0,0 +1,158 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+
2
+ ---
3
+ license: apache-2.0
4
+ language:
5
+ - en
6
+ - lg
7
+ tags:
8
+ - translation
9
+ - english
10
+ - luganda
11
+ - transformer
12
+ - pytorch
13
+ - seq2seq
14
+ - from-scratch
15
+ - low-resource-nlp
16
+ datasets:
17
+ - kambale/luganda-english-parallel-corpus
18
+ metrics:
19
+ - bleu
20
+ pipeline_tag: translation
21
+ model-index:
22
+ - name: pearl-23m-translate
23
+ results:
24
+ - task:
25
+ type: translation
26
+ name: Translation English to Luganda
27
+ dataset:
28
+ name: kambale/luganda-english-parallel-corpus (Test Split)
29
+ type: kambale/luganda-english-parallel-corpus
30
+ args: test
31
+ metrics:
32
+ - type: bleu
33
+ value: 39.76
34
+ name: BLEU
35
+ - type: loss
36
+ value: 1.803
37
+ name: Validation Loss (Best)
38
+ ---
39
+
40
+ # Pearl Model (23M-translate): English to Luganda Translation
41
+
42
+ This is the **Pearl Model (23M-translate-v2)**, a Transformer-based neural machine translation (NMT) model trained from scratch. It is designed to translate text from English to Luganda and contains approximately 23,363,200 parameters. This version features an increased maximum sequence length of 512, deeper encoder/decoder (6 layers each), wider feed-forward networks (PF_DIM 1024), an increased tokenizer vocabulary size (16000), and the Transformer's original learning rate schedule.
43
+
44
+ ## Model Overview
45
+
46
+ The Pearl Model is an encoder-decoder Transformer architecture implemented entirely in PyTorch.
47
+
48
+ - **Model Type:** Sequence-to-Sequence Transformer
49
+ - **Source Language:** English ('english')
50
+ - **Target Language:** Luganda ('luganda')
51
+ - **Framework:** PyTorch
52
+ - **Parameters:** ~23,363,200
53
+ - **Training:** From scratch with "Attention is All You Need" learning rate schedule.
54
+ - **Max Sequence Length:** 512 tokens
55
+ - **Tokenizer Vocabulary Size:** 16000 for both source and target.
56
+
57
+ Detailed hyperparameters, architectural specifics, and tokenizer configurations can be found in the accompanying `config.json` file.
58
+
59
+ ## Intended Use
60
+
61
+ This model is intended for:
62
+
63
+ * Translating general domain text from English to Luganda, including longer sentences up to 512 tokens.
64
+ * Research purposes in low-resource machine translation, Transformer architectures, and NLP for African languages.
65
+ * Serving as a baseline for future improvements in English-Luganda translation.
66
+ * Educational tool for understanding how to build and train NMT models from scratch.
67
+
68
+ **Out-of-scope:**
69
+
70
+ * Translation of highly specialized or technical jargon not present in the training data.
71
+ * High-stakes applications requiring perfect fluency or nuance without further fine-tuning and rigorous evaluation.
72
+ * Translation into English (this model is unidirectional: English to Luganda).
73
+
74
+ ## Training Details
75
+
76
+ ### Dataset
77
+
78
+ The model was trained exclusively on the `kambale/luganda-english-parallel-corpus` dataset available on the Hugging Face Hub.
79
+
80
+ * **Dataset ID:** [kambale/luganda-english-parallel-corpus](https://huggingface.co/datasets/kambale/luganda-english-parallel-corpus)
81
+ * **Training Epochs Attempted:** 50 (Early stopping based on validation loss was used)
82
+ * **Tokenizers:** Byte-Pair Encoding (BPE) tokenizers (vocab size: 16000) were trained from scratch.
83
+ * English Tokenizer: `english_tokenizer_v2.json`
84
+ * Luganda Tokenizer: `luganda_tokenizer_v2.json`
85
+
86
+ ### Compute Infrastructure
87
+
88
+ * **Hardware:** 1x NVIDIA A100 40GB (example)
89
+ * **Training Time:** ~3 hours for 35 epochs
90
+
91
+ ## Performance & Evaluation
92
+
93
+ * **Best Validation Loss:** 1.803
94
+ * **Test Set BLEU Score:** 39.76
95
+
96
+ *Example Validation Set Translations (from training run):*
97
+ *(Note: BLEU scores are calculated using SacreBLEU on detokenized text with `force=True`.)*
98
+
99
+ ### Training & Learning Rate Curves
100
+
101
+ ![Training, Validation Loss, and Learning Rate Curve](loss_and_lr_curves_v2.png)
102
+
103
+ ## How to Use
104
+
105
+ This model is provided with its PyTorch state dictionary, tokenizer files, and configuration. Manual loading is required.
106
+
107
+ ### Manual Loading (Conceptual Example)
108
+
109
+ 1. **Define Model Architecture:** Use the Python classes (`Seq2SeqTransformer`, `Encoder`, `Decoder`, etc.) from the training script.
110
+ 2. **Load Tokenizers & Config:**
111
+ ```python
112
+ from tokenizers import Tokenizer
113
+ import json
114
+ import torch
115
+ # from your_model_script import Seq2SeqTransformer, Encoder, Decoder, PositionalEncoding, ... (ensure these are defined)
116
+
117
+ # Load config
118
+ with open("config.json", 'r') as f:
119
+ config = json.load(f)
120
+
121
+ # Load tokenizers
122
+ src_tokenizer = Tokenizer.from_file(config["src_tokenizer_file"])
123
+ trg_tokenizer = Tokenizer.from_file(config["trg_tokenizer_file"])
124
+
125
+ # Model parameters from config
126
+ params = config["model_parameters"]
127
+ special_tokens = config["special_token_ids"]
128
+ DEVICE = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
129
+
130
+ enc = Encoder(params["input_dim_vocab_size_src"], params["hidden_dim"], params["encoder_layers"],
131
+ params["encoder_heads"], params["encoder_pf_dim"], params["encoder_dropout"],
132
+ DEVICE, params["max_seq_length"])
133
+ dec = Decoder(params["output_dim_vocab_size_trg"], params["hidden_dim"], params["decoder_layers"],
134
+ params["decoder_heads"], params["decoder_pf_dim"], params["decoder_dropout"],
135
+ DEVICE, params["max_seq_length"])
136
+
137
+ model = Seq2SeqTransformer(enc, dec, special_tokens["pad_token_id"], special_tokens["pad_token_id"], DEVICE)
138
+
139
+ model.load_state_dict(torch.load(config["pytorch_model_path"], map_location=DEVICE))
140
+ model.to(DEVICE)
141
+ model.eval()
142
+ ```
143
+ 3. **Inference:** Use a `translate_sentence` function similar to the one in the training notebook.
144
+
145
+ ## Limitations and Bias
146
+ (Content similar to original, adjusted for new context if necessary)
147
+ * **Low-Resource Pair & Data Size:** While performance has improved, Luganda remains low-resource. Model may struggle with OOV words or highly nuanced text.
148
+ * **Data Source Bias:** Biases in `kambale/luganda-english-parallel-corpus` will be reflected.
149
+ * **Generalization:** May not generalize well to very different domains.
150
+
151
+ ## Ethical Considerations
152
+ (Content similar to original)
153
+
154
+ ## Future Work
155
+ (Content similar to original, e.g., beam search, larger datasets/models)
156
+
157
+ ## Disclaimer
158
+ (Content similar to original)
config.json ADDED
@@ -0,0 +1,44 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "model_type": "seq2seq_transformer_from_scratch",
3
+ "architecture": "Encoder-Decoder Transformer",
4
+ "framework": "pytorch",
5
+ "source_language": "english",
6
+ "target_language": "luganda",
7
+ "pytorch_model_path": "pytorch_model.bin",
8
+ "src_tokenizer_file": "english_tokenizer_v2.json",
9
+ "trg_tokenizer_file": "luganda_tokenizer_v2.json",
10
+ "model_parameters": {
11
+ "input_dim_vocab_size_src": 16000,
12
+ "output_dim_vocab_size_trg": 16000,
13
+ "hidden_dim": 256,
14
+ "encoder_layers": 6,
15
+ "decoder_layers": 6,
16
+ "encoder_heads": 8,
17
+ "decoder_heads": 8,
18
+ "encoder_pf_dim": 1024,
19
+ "decoder_pf_dim": 1024,
20
+ "encoder_dropout": 0.1,
21
+ "decoder_dropout": 0.1,
22
+ "max_seq_length": 512
23
+ },
24
+ "special_token_ids": {
25
+ "pad_token_id": 0,
26
+ "sos_token_id": 1,
27
+ "eos_token_id": 2,
28
+ "unk_token_id": 3
29
+ },
30
+ "tokenizer_vocab_size": 16000,
31
+ "dataset_used_for_training": "kambale/luganda-english-parallel-corpus",
32
+ "training_epochs_run": 50,
33
+ "batch_size": 64,
34
+ "learning_rate_schedule": "Transformer original (Attention is All You Need)",
35
+ "warmup_steps": 4000,
36
+ "optimizer_betas": [
37
+ 0.9,
38
+ 0.98
39
+ ],
40
+ "optimizer_eps": 1e-09,
41
+ "best_validation_loss": 1.8031821854506866,
42
+ "bleu_on_test_set": 39.76353643835254,
43
+ "num_trainable_parameters": 23363200
44
+ }
english_tokenizer_v2.json ADDED
The diff for this file is too large to render. See raw diff
 
loss_and_lr_curves_v2.png ADDED
luganda_tokenizer_v2.json ADDED
The diff for this file is too large to render. See raw diff
 
pytorch_model.bin ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:948b496b77008b499c2424d4530ddc277d591bb309bdf800f68e018c5ebd81a7
3
+ size 94605786