---
datasets:
- janani-rane/SiQuAD
- rajpurkar/squad
language:
- si
metrics:
- f1
- exact_match
- bleu
- rouge
base_model:
- FacebookAI/xlm-roberta-large
pipeline_tag: question-answering
---

# 🧠 XLM-RoBERTa-Large for Sinhala Question Answering

This model is a fine-tuned version of `xlm-roberta-large` on SiQuAD + SQuAD V1.0. It is designed to perform extractive QA in Sinhala and is evaluated using both standard and extended QA metrics.

---

## 📊 Evaluation Results

The model was evaluated on a Sinhala + English test set using multiple QA performance metrics.

| **Model**          | **Data**             | **F1 Score** | **Exact Match** | **Jaccard Score** | **BLEU-1** | **BLEU-2** | **ROUGE-L** |
|--------------------|----------------------|--------------|------------------|-------------------|------------|------------|--------------|
| XLM-RoBERTa-Large  | Sinhala + English    | **73.41**    | **60.16**        | **70.29**         | **76.72**  | **75.70**  | **21.11**    |

### 🔍 Metric Descriptions

- **F1 Score**: Harmonic mean of precision and recall on answer spans.
- **Exact Match (EM)**: Percentage of predictions that match the exact gold answer.
- **Jaccard Score**: Lexical overlap between predicted and gold answers.
- **BLEU-1 / BLEU-2**: Measures n-gram overlap, capturing fluency and correctness.
- **ROUGE-L**: Measures longest common subsequence (LCS) between predictions and references.

---

## 🛠️ How to Use

```python
from transformers import AutoTokenizer, AutoModelForQuestionAnswering

tokenizer = AutoTokenizer.from_pretrained("your-username/xlm-r-sinhala-qa")
model = AutoModelForQuestionAnswering.from_pretrained("your-username/xlm-r-sinhala-qa")

question = "Your Question"
context =  "Your Context"
inputs = tokenizer(question, context, return_tensors="pt")
outputs = model(**inputs)

# Extract answer span
start_logits = outputs.start_logits
end_logits = outputs.end_logits

# Extract the Answer
start_idx = outputs.start_logits.argmax()
end_idx = outputs.end_logits.argmax()

answer_tokens = inputs.input_ids[0][start_idx:end_idx+1]
answer = tokenizer.decode(answer_tokens)
print(f"Question: {question}")
print(f"Answer: {answer}")
```
---

## 🛠️ Sinhala Sample Question, Answer, Context triplets

- Question = "නම් දහසක් ඇති දූපත ලෙස හැදින්වුයේ කුමන රටක් ද?"
- Context = 'ශ්‍රී ලංකාව බ්‍රිතාන්‍ය පාලන සමයේ දී සිලෝන් ලෙස හැඳින්වු අතර වර්තමානයේ නිල වශයෙන් ශ්‍රී ලංකා ප්‍රජාතාන්ත්‍රික සමාජවාදී ජනරජය ලෙස හදුන්වයි. ශ්‍රී ලංකාව සෙරන්ඩිබ්, සිලොන්, දීප්තිමත් දිවයින, ධර්ම දිවයින, ඉන්දියන් සාගරයේ මුතු ඇටය නමින්ද හදුන්වයි. විවිධ නම් ගණනාවකින් පුරාණයේ හැඳින්වූ මෙම දූපත "නම් දහසක් ඇති දූපත" ලෙස ප්‍රචලිත ද විය. මෙය දකුණු ආසියානු දූපතකි. එය ඉන්දියානු සාගරයේ, බෙංගාල බොක්කෙහි නිරිත දෙසින් සහ අරාබි මුහුදේ ගිනිකොන දෙසින් පිහිටා ඇත; එය ඉන්දියානු උප මහද්වීපය මන්නාරම් බොක්ක සහ පෝක් සමුද්‍ර සන්ධිය මගින් වෙන් කරනු ලැබේ. ශ්‍රී ලංකාවේ පිහිටීම ප්‍රධාන මුහුදු මාර්ගවල පිහිටීම හේතුවෙන් බටහිර ආසියාව සහ අග්නිදිග ආසියාව අතර උපායමාර්ගික නාවික සම්බන්ධකයක් ලෙස එය පුරාණ කාලයේ සිටම භාවිතය ගැණුනි. ඉන්දියානු ජනරජය සහ ශ්‍රී ලංකා ජනරජය 1976 මාර්තු 23 වන දින මනාර් බොක්ක සහ බෙංගාල බොක්කෙහි සමුද්‍ර මායිම් ස්ථාපිත කරමින් ගිවිසුමක් අත්සන් කරන ලදී. ශ්‍රී ලංකාවට ආසන්න ම රටවල් ලෙස ඉන්දියාව සහ මාලදිවයින හැදින්විය හැකි ය.'
- Answer - 'ශ්‍රී ලංකාව'