Update model v2: 50 epochs, Val Loss 1.803, Test BLEU 39.76, Params 23,363,200

Browse files

Files changed (6) hide show

README.md +158 -0
config.json +44 -0
english_tokenizer_v2.json +0 -0
loss_and_lr_curves_v2.png +0 -0
luganda_tokenizer_v2.json +0 -0
pytorch_model.bin +3 -0

README.md ADDED Viewed

	@@ -0,0 +1,158 @@

+---
+license: apache-2.0
+language:
+- en
+- lg
+tags:
+- translation
+- english
+- luganda
+- transformer
+- pytorch
+- seq2seq
+- from-scratch
+- low-resource-nlp
+datasets:
+- kambale/luganda-english-parallel-corpus
+metrics:
+- bleu
+pipeline_tag: translation
+model-index:
+- name: pearl-23m-translate
+  results:
+  - task:
+      type: translation
+      name: Translation English to Luganda
+    dataset:
+      name: kambale/luganda-english-parallel-corpus (Test Split)
+      type: kambale/luganda-english-parallel-corpus
+      args: test
+    metrics:
+    - type: bleu
+      value: 39.76
+      name: BLEU
+    - type: loss
+      value: 1.803
+      name: Validation Loss (Best)
+---
+# Pearl Model (23M-translate): English to Luganda Translation
+This is the **Pearl Model (23M-translate-v2)**, a Transformer-based neural machine translation (NMT) model trained from scratch. It is designed to translate text from English to Luganda and contains approximately 23,363,200 parameters. This version features an increased maximum sequence length of 512, deeper encoder/decoder (6 layers each), wider feed-forward networks (PF_DIM 1024), an increased tokenizer vocabulary size (16000), and the Transformer's original learning rate schedule.
+## Model Overview
+The Pearl Model is an encoder-decoder Transformer architecture implemented entirely in PyTorch.
+-   **Model Type:** Sequence-to-Sequence Transformer
+-   **Source Language:** English ('english')
+-   **Target Language:** Luganda ('luganda')
+-   **Framework:** PyTorch
+-   **Parameters:** ~23,363,200
+-   **Training:** From scratch with "Attention is All You Need" learning rate schedule.
+-   **Max Sequence Length:** 512 tokens
+-   **Tokenizer Vocabulary Size:** 16000 for both source and target.
+Detailed hyperparameters, architectural specifics, and tokenizer configurations can be found in the accompanying `config.json` file.
+## Intended Use
+This model is intended for:
+* Translating general domain text from English to Luganda, including longer sentences up to 512 tokens.
+* Research purposes in low-resource machine translation, Transformer architectures, and NLP for African languages.
+* Serving as a baseline for future improvements in English-Luganda translation.
+* Educational tool for understanding how to build and train NMT models from scratch.
+**Out-of-scope:**
+* Translation of highly specialized or technical jargon not present in the training data.
+* High-stakes applications requiring perfect fluency or nuance without further fine-tuning and rigorous evaluation.
+* Translation into English (this model is unidirectional: English to Luganda).
+## Training Details
+### Dataset
+The model was trained exclusively on the `kambale/luganda-english-parallel-corpus` dataset available on the Hugging Face Hub.
+* **Dataset ID:** [kambale/luganda-english-parallel-corpus](https://huggingface.co/datasets/kambale/luganda-english-parallel-corpus)
+* **Training Epochs Attempted:** 50 (Early stopping based on validation loss was used)
+* **Tokenizers:** Byte-Pair Encoding (BPE) tokenizers (vocab size: 16000) were trained from scratch.
+    * English Tokenizer: `english_tokenizer_v2.json`
+    * Luganda Tokenizer: `luganda_tokenizer_v2.json`
+### Compute Infrastructure
+* **Hardware:** 1x NVIDIA A100 40GB (example)
+* **Training Time:**  ~3 hours for 35 epochs
+## Performance & Evaluation
+* **Best Validation Loss:** 1.803
+* **Test Set BLEU Score:** 39.76
+*Example Validation Set Translations (from training run):*
+*(Note: BLEU scores are calculated using SacreBLEU on detokenized text with `force=True`.)*
+### Training & Learning Rate Curves
+![Training, Validation Loss, and Learning Rate Curve](loss_and_lr_curves_v2.png)
+## How to Use
+This model is provided with its PyTorch state dictionary, tokenizer files, and configuration. Manual loading is required.
+### Manual Loading (Conceptual Example)
+1.  **Define Model Architecture:** Use the Python classes (`Seq2SeqTransformer`, `Encoder`, `Decoder`, etc.) from the training script.
+2.  **Load Tokenizers & Config:**
+    ```python
+    from tokenizers import Tokenizer
+    import json
+    import torch
+    # from your_model_script import Seq2SeqTransformer, Encoder, Decoder, PositionalEncoding, ... (ensure these are defined)
+    # Load config
+    with open("config.json", 'r') as f:
+        config = json.load(f)
+    # Load tokenizers
+    src_tokenizer = Tokenizer.from_file(config["src_tokenizer_file"])
+    trg_tokenizer = Tokenizer.from_file(config["trg_tokenizer_file"])
+    # Model parameters from config
+    params = config["model_parameters"]
+    special_tokens = config["special_token_ids"]
+    DEVICE = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
+    enc = Encoder(params["input_dim_vocab_size_src"], params["hidden_dim"], params["encoder_layers"],
+                  params["encoder_heads"], params["encoder_pf_dim"], params["encoder_dropout"],
+                  DEVICE, params["max_seq_length"])
+    dec = Decoder(params["output_dim_vocab_size_trg"], params["hidden_dim"], params["decoder_layers"],
+                  params["decoder_heads"], params["decoder_pf_dim"], params["decoder_dropout"],
+                  DEVICE, params["max_seq_length"])
+    model = Seq2SeqTransformer(enc, dec, special_tokens["pad_token_id"], special_tokens["pad_token_id"], DEVICE)
+    model.load_state_dict(torch.load(config["pytorch_model_path"], map_location=DEVICE))
+    model.to(DEVICE)
+    model.eval()
+    ```
+3.  **Inference:** Use a `translate_sentence` function similar to the one in the training notebook.
+## Limitations and Bias
+(Content similar to original, adjusted for new context if necessary)
+* **Low-Resource Pair & Data Size:** While performance has improved, Luganda remains low-resource. Model may struggle with OOV words or highly nuanced text.
+* **Data Source Bias:** Biases in `kambale/luganda-english-parallel-corpus` will be reflected.
+* **Generalization:** May not generalize well to very different domains.
+## Ethical Considerations
+(Content similar to original)
+## Future Work
+(Content similar to original, e.g., beam search, larger datasets/models)
+## Disclaimer
+(Content similar to original)

config.json ADDED Viewed

	@@ -0,0 +1,44 @@

+{
+    "model_type": "seq2seq_transformer_from_scratch",
+    "architecture": "Encoder-Decoder Transformer",
+    "framework": "pytorch",
+    "source_language": "english",
+    "target_language": "luganda",
+    "pytorch_model_path": "pytorch_model.bin",
+    "src_tokenizer_file": "english_tokenizer_v2.json",
+    "trg_tokenizer_file": "luganda_tokenizer_v2.json",
+    "model_parameters": {
+        "input_dim_vocab_size_src": 16000,
+        "output_dim_vocab_size_trg": 16000,
+        "hidden_dim": 256,
+        "encoder_layers": 6,
+        "decoder_layers": 6,
+        "encoder_heads": 8,
+        "decoder_heads": 8,
+        "encoder_pf_dim": 1024,
+        "decoder_pf_dim": 1024,
+        "encoder_dropout": 0.1,
+        "decoder_dropout": 0.1,
+        "max_seq_length": 512
+    },
+    "special_token_ids": {
+        "pad_token_id": 0,
+        "sos_token_id": 1,
+        "eos_token_id": 2,
+        "unk_token_id": 3
+    },
+    "tokenizer_vocab_size": 16000,
+    "dataset_used_for_training": "kambale/luganda-english-parallel-corpus",
+    "training_epochs_run": 50,
+    "batch_size": 64,
+    "learning_rate_schedule": "Transformer original (Attention is All You Need)",
+    "warmup_steps": 4000,
+    "optimizer_betas": [
+        0.9,
+        0.98
+    ],
+    "optimizer_eps": 1e-09,
+    "best_validation_loss": 1.8031821854506866,
+    "bleu_on_test_set": 39.76353643835254,
+    "num_trainable_parameters": 23363200
+}

english_tokenizer_v2.json ADDED Viewed

The diff for this file is too large to render. See raw diff

loss_and_lr_curves_v2.png ADDED Viewed

luganda_tokenizer_v2.json ADDED Viewed

The diff for this file is too large to render. See raw diff

pytorch_model.bin ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:948b496b77008b499c2424d4530ddc277d591bb309bdf800f68e018c5ebd81a7
+size 94605786