Pocket Polyglot Mzansi 50M (4 languages)

Pocket Polyglot Mzansi is a small 50M parameter machine translation model for South African languages. The model is part of an ongoing research project that aims to develop a small (<50M parameters) machine translation model that matches or exceeds the accuracy of NLLN-200-600M on South African languages. The current version of the model is > 90% smaller than NLLB-200-600M, but sacrifices only 6.3% in accuracy in terms of chrF++.

Model Details

Model Description

Developed by: Stefan Strydom
Model type: Small 50M parameter translation model for four South African languages built using the architecture from NLLB-200.
Language(s) (NLP): - Afrikaans (afr_Latn), English (eng_Latn), isiXhosa (xho_Latn), isiZulu (zul_Latn)
License: CC BY-NC 4.0.

Model Sources [optional]

Repository: Coming soon
Paper: Deep Learning IndabaX South Africa 2025 slides
Demo: Demo app | Demo repo

Intended use

Pocket Polyglot Mzansi is a research model. The intended use is deployment on edge devices for offline machine translation. It allows for single sentence translation among four languages.

How to Get Started with the Model

Use the code below to get started with the model.

>>> from transformers import AutoTokenizer, AutoModelForSeq2SeqLM

>>> tokenizer = AutoTokenizer.from_pretrained("stefan7/pocket_polyglot_mzansi_50M_4langs")
>>> model = AutoModelForSeq2SeqLM.from_pretrained("stefan7/pocket_polyglot_mzansi_50M_4langs")

>>> tokenizer.src_lang = "eng_Latn"
>>> text = "How was your day?"
>>> inputs = tokenizer(text, return_tensors="pt")

>>> translated_tokens = model.generate(
...     **inputs, forced_bos_token_id=tokenizer.convert_tokens_to_ids("xho_Latn")
... )
>>> tokenizer.batch_decode(translated_tokens, skip_special_tokens=True)[0]
Wawunjani umhla wakho?

Training Details

Training Data

The model was trained on data from WMT22-African

Training Procedure

Batch size of 128 sentences
Trained on 30M sentences (230,000 update steps)
1cycle policy scheduler with:
- two phases
- max_lr = 1e-3
- pct_start = 0.25
- anneal_strategy = 'cos'
- div_factor = 25.0
- final_div_factor = 1e5
Adam optimizer with mom=0.9, sqr_mom=0.98, eps=1e-6
No dropout or weight decay (not considered/tuned yet for this work)

Evaluation

Testing Data

Tested on Flores200 devtest split

Metrics

Following the approach used by the NLLB-200 project, the model was evaluated using spBLEU and chrF++, metrics widely adopted by machine translation community.

Results

Results for the original model translating four South African languages (12 translation directions):

	Our 50M model	NLLB-200-600M	% difference
Number of parameters	49,260,544	615,073,792	-92.0%
Memory footprint in 16-bit (GB)	0.09	1.15	-91.9%
chrF++v	48.8	52.1	-6.3%
spBLEU	25.1	29.5	-14.7%

chrF++ scores by language direction (all 12 directions for the original four languages):

Source language	Target language	Our 50M model	NLLB-200 600M	Difference
isiXhosa	isiZulu	44.3	45.5	-1.3
isiXhosa	Afrikaans	43.0	46.1	-3.0
isiXhosa	English	48.3	55.7	-7.5
isiZulu	isiXhosa	42.9	42.6	0.3
isiZulu	Afrikaans	44.5	47.1	-2.5
isiZulu	English	49.3	57.3	-8.0
Afrikaans	isiXhosa	41.6	44.3	-2.8
Afrikaans	isiZulu	44.9	47.6	-2.7
Afrikaans	English	65.9	73.5	-7.7
English	isiXhosa	46.2	47.3	-1.2
English	isiZulu	49.5	51.6	-2.1
English	Afrikaans	62.1	63.3	-1.2

Compute Infrastructure & Environmental Impact

All experiments ran on a single NVIDIA A5000 (24GB) or A6000 (48GB) GPU
Total training time for a single model: 10 hours on A5000 ($4.40 using Jarvis Labs instances @ $0.44/hour)
Estimated carbon emissions for a single training run: 1.43kg CO2eq (estimated using Machine Learning Impact calculator presented in Lacoste et al. (2019))

stefan7
/

pocket_polyglot_mzansi_50M_4langs