🧠 XLM-RoBERTa-Large for Sinhala Question Answering

This model is a fine-tuned version of xlm-roberta-large on SiQuAD + SQuAD V1.0. It is designed to perform extractive QA in Sinhala and is evaluated using both standard and extended QA metrics.

📊 Evaluation Results

The model was evaluated on a Sinhala + English test set using multiple QA performance metrics.

Model	Data	F1 Score	Exact Match	Jaccard Score	BLEU-1	BLEU-2	ROUGE-L
XLM-RoBERTa-Large	Sinhala + English	73.41	60.16	70.29	76.72	75.70	21.11

🔍 Metric Descriptions

F1 Score: Harmonic mean of precision and recall on answer spans.
Exact Match (EM): Percentage of predictions that match the exact gold answer.
Jaccard Score: Lexical overlap between predicted and gold answers.
BLEU-1 / BLEU-2: Measures n-gram overlap, capturing fluency and correctness.
ROUGE-L: Measures longest common subsequence (LCS) between predictions and references.

🛠️ How to Use

from transformers import AutoTokenizer, AutoModelForQuestionAnswering

tokenizer = AutoTokenizer.from_pretrained("your-username/xlm-r-sinhala-qa")
model = AutoModelForQuestionAnswering.from_pretrained("your-username/xlm-r-sinhala-qa")

question = "Your Question"
context =  "Your Context"
inputs = tokenizer(question, context, return_tensors="pt")
outputs = model(**inputs)

# Extract answer span
start_logits = outputs.start_logits
end_logits = outputs.end_logits

# Extract the Answer
start_idx = outputs.start_logits.argmax()
end_idx = outputs.end_logits.argmax()

answer_tokens = inputs.input_ids[0][start_idx:end_idx+1]
answer = tokenizer.decode(answer_tokens)
print(f"Question: {question}")
print(f"Answer: {answer}")

🛠️ Sinhala Sample Question, Answer, Context triplets

Question = "නම් දහසක් ඇති දූපත ලෙස හැදින්වුයේ කුමන රටක් ද?"
Context = 'ශ්‍රී ලංකාව බ්‍රිතාන්‍ය පාලන සමයේ දී සිලෝන් ලෙස හැඳින්වු අතර වර්තමානයේ නිල වශයෙන් ශ්‍රී ලංකා ප්‍රජාතාන්ත්‍රික සමාජවාදී ජනරජය ලෙස හදුන්වයි. ශ්‍රී ලංකාව සෙරන්ඩිබ්, සිලොන්, දීප්තිමත් දිවයින, ධර්ම දිවයින, ඉන්දියන් සාගරයේ මුතු ඇටය නමින්ද හදුන්වයි. විවිධ නම් ගණනාවකින් පුරාණයේ හැඳින්වූ මෙම දූපත "නම් දහසක් ඇති දූපත" ලෙස ප්‍රචලිත ද විය. මෙය දකුණු ආසියානු දූපතකි. එය ඉන්දියානු සාගරයේ, බෙංගාල බොක්කෙහි නිරිත දෙසින් සහ අරාබි මුහුදේ ගිනිකොන දෙසින් පිහිටා ඇත; එය ඉන්දියානු උප මහද්වීපය මන්නාරම් බොක්ක සහ පෝක් සමුද්‍ර සන්ධිය මගින් වෙන් කරනු ලැබේ. ශ්‍රී ලංකාවේ පිහිටීම ප්‍රධාන මුහුදු මාර්ගවල පිහිටීම හේතුවෙන් බටහිර ආසියාව සහ අග්නිදිග ආසියාව අතර උපායමාර්ගික නාවික සම්බන්ධකයක් ලෙස එය පුරාණ කාලයේ සිටම භාවිතය ගැණුනි. ඉන්දියානු ජනරජය සහ ශ්‍රී ලංකා ජනරජය 1976 මාර්තු 23 වන දින මනාර් බොක්ක සහ බෙංගාල බොක්කෙහි සමුද්‍ර මායිම් ස්ථාපිත කරමින් ගිවිසුමක් අත්සන් කරන ලදී. ශ්‍රී ලංකාවට ආසන්න ම රටවල් ලෙස ඉන්දියාව සහ මාලදිවයින හැදින්විය හැකි ය.'
Answer - 'ශ්‍රී ලංකාව'

janani-rane
/

sinQA-xlm-r-finetuned

🧠 XLM-RoBERTa-Large for Sinhala Question Answering

📊 Evaluation Results

🔍 Metric Descriptions

🛠️ How to Use

🛠️ Sinhala Sample Question, Answer, Context triplets

Model tree for janani-rane/sinQA-xlm-r-finetuned

Datasets used to train janani-rane/sinQA-xlm-r-finetuned