---
language:
- si
- en
tags:
- transliteration
- sinhala
- mbart
- sequence-to-sequence
license: mit
datasets:
- deshanksuman/Augmented_SinhalatoRomanizedSinhala_Dataset
metrics:
- accuracy
---

# Swa bhasha mBART-50 Sinhala Transliteration Model

This model transliterates Romanized Sinhala text to Sinhala script.

## Model description

This is a fine-tuned version of deshanksuman/mbart_50_SinhalaTransliteration specialized for Sinhala transliteration.
It converts romanized Sinhala (using Latin characters) to proper Sinhala script. Due to the training limitation
only 2/3 of data was utilized for the training purposes.

## Intended uses & limitations

This model is intended for transliterating Romanized Sinhala text to proper Sinhala script.
It can be useful for:
- Text input conversion in applications
- Helping non-native speakers type in Sinhala
- Converting legacy text in romanized format to proper Sinhala

## Acknowledgement
We acknowledge the support of the Supercomputing Wales project, which is part-funded by the European Regional Development Fund (ERDF) via Welsh Government.


### How to use

```python
from transformers import MBartForConditionalGeneration, MBartTokenizer

# Load model and tokenizer
model_name = "deshanksuman/swabhashambart50SinhalaTransliteration"
tokenizer = MBartTokenizer.from_pretrained(model_name)
model = MBartForConditionalGeneration.from_pretrained(model_name)

# Set language codes
tokenizer.src_lang = "en_XX"  # Using English as source language token
tokenizer.tgt_lang = "si_LK"  # Sinhala as target

# Prepare input
text = "mama oyata adare karanawa"
inputs = tokenizer(text, return_tensors="pt", max_length=128, padding="max_length", truncation=True)

# Generate output
outputs = model.generate(
    input_ids=inputs["input_ids"],
    attention_mask=inputs["attention_mask"],
    max_length=128,
    num_beams=5,
    early_stopping=True
)

# Decode output
result = tokenizer.decode(outputs[0], skip_special_tokens=True)
print(result)
```

## Training data

The model was trained on the [deshanksuman/Augmented_SinhalatoRomanizedSinhala_Dataset](https://huggingface.co/deshanksuman/Augmented_SinhalatoRomanizedSinhala_Dataset) dataset, which contains pairs of Romanized Sinhala and corresponding Sinhala script text.

## Training procedure

The model was trained with the following parameters:
- Learning rate: 5e-05
- Batch size: 32
- Number of epochs: 1
- Max sequence length: 128
- Optimizer: AdamW


## Model Performance

The model achieves an accuracy of 72.00% on the test set.

## Examples:


**Example 1:**
- Romanized: karandeniya em
- Expected: කරන්දෙණිය එම්
- Predicted: කරන්දෙණිය එම්
- Correct: True

**Example 2:**
- Romanized: yatawena minissu hoya ganna beriwa mihidan wenawa
- Expected: යටවෙන මිනිස්සු හොයා ගන්න බැරිව මිහිදන් වෙනවා
- Predicted: යටවෙන මිනිස්සු හොයා ගන්න බැරිව මිහිදන් වෙනවා
- Correct: True