--- language: - si - en tags: - transliteration - sinhala - mbart - sequence-to-sequence license: mit datasets: - deshanksuman/Augmented_SinhalatoRomanizedSinhala_Dataset metrics: - accuracy --- # Swa bhasha mBART-50 Sinhala Transliteration Model This model transliterates Romanized Sinhala text to Sinhala script. ## Model description This is a fine-tuned version of deshanksuman/mbart_50_SinhalaTransliteration specialized for Sinhala transliteration. It converts romanized Sinhala (using Latin characters) to proper Sinhala script. Due to the training limitation only 2/3 of data was utilized for the training purposes. ## Intended uses & limitations This model is intended for transliterating Romanized Sinhala text to proper Sinhala script. It can be useful for: - Text input conversion in applications - Helping non-native speakers type in Sinhala - Converting legacy text in romanized format to proper Sinhala ## Acknowledgement We acknowledge the support of the Supercomputing Wales project, which is part-funded by the European Regional Development Fund (ERDF) via Welsh Government. ### How to use ```python from transformers import MBartForConditionalGeneration, MBartTokenizer # Load model and tokenizer model_name = "deshanksuman/swabhashambart50SinhalaTransliteration" tokenizer = MBartTokenizer.from_pretrained(model_name) model = MBartForConditionalGeneration.from_pretrained(model_name) # Set language codes tokenizer.src_lang = "en_XX" # Using English as source language token tokenizer.tgt_lang = "si_LK" # Sinhala as target # Prepare input text = "mama oyata adare karanawa" inputs = tokenizer(text, return_tensors="pt", max_length=128, padding="max_length", truncation=True) # Generate output outputs = model.generate( input_ids=inputs["input_ids"], attention_mask=inputs["attention_mask"], max_length=128, num_beams=5, early_stopping=True ) # Decode output result = tokenizer.decode(outputs[0], skip_special_tokens=True) print(result) ``` ## Training data The model was trained on the [deshanksuman/Augmented_SinhalatoRomanizedSinhala_Dataset](https://huggingface.co/deshanksuman/Augmented_SinhalatoRomanizedSinhala_Dataset) dataset, which contains pairs of Romanized Sinhala and corresponding Sinhala script text. ## Training procedure The model was trained with the following parameters: - Learning rate: 5e-05 - Batch size: 32 - Number of epochs: 1 - Max sequence length: 128 - Optimizer: AdamW ## Model Performance The model achieves an accuracy of 72.00% on the test set. ## Examples: **Example 1:** - Romanized: karandeniya em - Expected: කරන්දෙණිය එම් - Predicted: කරන්දෙණිය එම් - Correct: True **Example 2:** - Romanized: yatawena minissu hoya ganna beriwa mihidan wenawa - Expected: යටවෙන මිනිස්සු හොයා ගන්න බැරිව මිහිදන් වෙනවා - Predicted: යටවෙන මිනිස්සු හොයා ගන්න බැරිව මිහිදන් වෙනවා - Correct: True