Transformers
Safetensors
m2m_100
text2text-generation

Pocket Polyglot Mzansi 50M (6 languages)

Pocket Polyglot Mzansi is a small 50M parameter machine translation model for South African languages. The model is part of an ongoing research project that aims to develop a small (<50M parameters) machine translation model that matches or exceeds the accuracy of NLLN-200-600M on South African languages. The current version of the model is > 90% smaller than NLLB-200-600M, but sacrifices only 6.3% in accuracy in terms of chrF++.

Model Details

Model Description

  • Developed by: Stefan Strydom
  • Model type: Small 50M parameter translation model for six South African languages built using the architecture from NLLB-200.
  • Language(s) (NLP): - Afrikaans (afr_Latn), English (eng_Latn), isiXhosa (xho_Latn), isiZulu (zul_Latn), Setswana (tsn_Latn), Sepedi (nso_Latn)
  • License: CC BY-NC 4.0.

Model Sources [optional]

Intended use

Pocket Polyglot Mzansi is a research model. The intended use is deployment on edge devices for offline machine translation. It allows for single sentence translation among six languages.

How to Get Started with the Model

Use the code below to get started with the model.

>>> from transformers import AutoTokenizer, AutoModelForSeq2SeqLM

>>> tokenizer = AutoTokenizer.from_pretrained("stefan7/pocket_polyglot_mzansi_50M_6langs")
>>> model = AutoModelForSeq2SeqLM.from_pretrained("stefan7/pocket_polyglot_mzansi_50M_6langs")

>>> tokenizer.src_lang = "eng_Latn"
>>> text = "How was your day?"
>>> inputs = tokenizer(text, return_tensors="pt")

>>> translated_tokens = model.generate(
...     **inputs, forced_bos_token_id=tokenizer.convert_tokens_to_ids("xho_Latn")
... )
>>> tokenizer.batch_decode(translated_tokens, skip_special_tokens=True)[0]
Wawunjani umhla wakho?

Training Details

Training Data

The model was trained on data from WMT22-African

Training Procedure

  • Batch size of 128 sentences
  • Trained on 45M sentences (351,000 update steps)
  • 1cycle policy scheduler with:
    • two phases
    • max_lr = 1e-3
    • pct_start = 0.25
    • anneal_strategy = 'cos'
    • div_factor = 25.0
    • final_div_factor = 1e5
  • Adam optimizer with mom=0.9, sqr_mom=0.98, eps=1e-6
  • No dropout or weight decay (not considered/tuned yet for this work)

Evaluation

Testing Data

Tested on Flores200 devtest split

Metrics

Following the approach used by the NLLB-200 project, the model was evaluated using spBLEU and chrF++, metrics widely adopted by machine translation community.

Results

Results for the original model translating four South African languages (12 translation directions):

Our 50M model NLLB-200-600M % difference
Number of parameters 49,260,544 615,073,792 -92.0%
Memory footprint in 16-bit (GB) 0.09 1.15 -91.9%
chrF++v 48.8 52.1 -6.3%
spBLEU 25.1 29.5 -14.7%

chrF++ scores by language direction (all 12 directions for the original four languages plus selected additional translation directions for brevity):

Source language Target language Our 50M model NLLB-200 600M Difference
isiXhosa isiZulu 44.3 45.5 -1.3
isiXhosa Afrikaans 43.0 46.1 -3.0
isiXhosa English 48.3 55.7 -7.5
isiZulu isiXhosa 42.9 42.6 0.3
isiZulu Afrikaans 44.5 47.1 -2.5
isiZulu English 49.3 57.3 -8.0
Afrikaans isiXhosa 41.6 44.3 -2.8
Afrikaans isiZulu 44.9 47.6 -2.7
Afrikaans English 65.9 73.5 -7.7
English isiXhosa 46.2 47.3 -1.2
English isiZulu 49.5 51.6 -2.1
English Afrikaans 62.1 63.3 -1.2
English Sepedi 48.9 49.4 -0.5
English Setswana 45.3 47.3 -2.0
isiZulu Sepedi 44.9 45.2 -0.3
isiZulu Setswana 43.7 44.4 -0.7
Sepedi English 48.1 57.1 -9.0
Sepedi isiZulu 43.1 45.3 -2.2
Sepedi Setswana 42.2 43.6 -1.4
Setswana English 41.4 48.5 -7.2
Setswana isiZulu 39.3 41.3 -2.1
Setswana Sepedi 40.4 41.8 -1.4

Compute Infrastructure & Environmental Impact

  • All experiments ran on a single NVIDIA A5000 (24GB) or A6000 (48GB) GPU
  • Total training time for a single model: 20 hours on A5000 ($8.80 using Jarvis Labs instances @ $0.44/hour)
  • Estimated carbon emissions for a single training run: 2.85kg CO2eq (estimated using Machine Learning Impact calculator presented in Lacoste et al. (2019))
Downloads last month
10
Safetensors
Model size
49.3M params
Tensor type
F32
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Dataset used to train stefan7/pocket_polyglot_mzansi_50M_6langs