PEFT
Safetensors
Sinhala
File size: 7,989 Bytes
c90df4a
 
 
 
 
 
 
 
 
 
 
 
 
6c741fd
 
 
 
0438d73
6c741fd
6d1e796
6c741fd
4f464b1
6d1e796
0438d73
6c741fd
 
 
 
 
0438d73
6c741fd
0438d73
6c741fd
0438d73
 
 
 
 
 
 
6c741fd
0438d73
6c741fd
0438d73
 
 
6c741fd
0438d73
6c741fd
3f1fe15
21416b6
3f1fe15
6c741fd
 
 
0438d73
6d1e796
0438d73
 
6c741fd
 
0438d73
 
 
6c741fd
0438d73
6c741fd
 
 
0438d73
 
 
6c741fd
 
6d1e796
6c741fd
0438d73
6c741fd
 
 
a084456
0438d73
fb042ed
a084456
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
fb042ed
a084456
6c741fd
a084456
 
 
 
 
 
 
fb042ed
a084456
 
0438d73
a084456
 
 
 
0e28b73
6c741fd
 
 
 
0438d73
 
 
 
 
6c741fd
 
0438d73
 
 
6c741fd
 
0438d73
 
6c741fd
0438d73
6c741fd
 
 
0438d73
 
 
6c741fd
0438d73
 
6c741fd
 
 
0438d73
 
 
 
 
 
6c741fd
0438d73
6c741fd
0438d73
6c741fd
 
 
0438d73
 
 
 
 
6c741fd
0438d73
6c741fd
0438d73
6c741fd
 
0438d73
 
 
6c741fd
 
0438d73
 
6c741fd
0438d73
6c741fd
0438d73
6c741fd
 
0438d73
 
3a43063
 
0438d73
 
 
 
6c741fd
 
0438d73
6c741fd
0438d73
6c741fd
0438d73
 
6c741fd
 
0438d73
6c741fd
 
0438d73
c90df4a
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
---
datasets:
- polyglots/MADLAD_CulturaX_cleaned
language:
- si
metrics:
- precision
- recall
- f1
base_model:
- meta-llama/Meta-Llama-3-8B
library_name: peft
---
base_model: meta-llama/Meta-Llama-3-8B
library_name: peft
---

# Model Card for SinLlama

SinLlama is the first large language model specifically extended for Sinhala. It is based on Meta-Llama-3-8B and adapted through tokenizer vocabulary extension and continual pretraining on a 10M sentence Sinhala corpus. SinLlama significantly improves coverage and performance for Sinhala NLP tasks compared to base and instruct versions of Llama-3-8B. 

*DISCLAIMER:*
This is a base model, which has NOT been instruct-tuned. So you still need to do task-specific fine-tuning.
---

## Model Details

### Model Description

SinLlama is a decoder-based large language model designed to improve NLP performance for Sinhala, a low-resource Indo-Aryan language spoken by ~20 million people in Sri Lanka. The model was developed by enhancing the Llama-3-8B tokenizer with Sinhala-specific vocabulary and performing continual pretraining on a cleaned and diverse 10.7M-sentence Sinhala corpus.  

Subsequent fine-tuning on Sinhala classification datasets (news categorization, sentiment analysis, and writing style classification) shows significant improvements over baseline Llama-3-8B models.

- **Developed by:** H.W.K. Aravinda, Rashad Sirajudeen, Samith Karunathilake, Nisansa de Silva, Rishemjit Kaur, Surangika Ranathunga:contentReference[oaicite:1]{index=1}  
- **Funded by:** CSIR - Central Scientific Instruments Organization (India), Emojot (Pvt) Ltd:contentReference[oaicite:2]{index=2}  
- **Shared by:** Polyglots team  
- **Model type:** Decoder-only autoregressive transformer LLM  
- **Language(s) (NLP):** Sinhala (සිංහල)  
- **License:** Same as base model (Meta Llama 3 license)  
- **Finetuned from model:** meta-llama/Meta-Llama-3-8B  

### Model Sources

- **Repository:** [Hugging Face - SinLlama v01](https://huggingface.co/polyglots/SinLlama_v01)  
- **Paper:** [SinLlama: A Large Language Model for Sinhala](https://arxiv.org/abs/2508.09115v2)  
- **Dataset:** [MADLAD+CulturaX (cleaned Sinhala subset)](https://huggingface.co/datasets/polyglots/MADLAD_CulturaX_cleaned)  

---

### SinLlama Model Creation
![SinLlama Logo](asserts/SinLlama.png)

## Uses


### Downstream Use
- Instruction tuning for Sinhala dialogue systems, text classification, etc
- Cross-lingual applications involving Sinhala
- Educational and research applications in low-resource NLP

### Out-of-Scope Use
- Applications requiring high accuracy in non-Sinhala languages (performance may degrade due to adaptation focus on Sinhala)
- Sensitive domains (e.g., healthcare, legal) without rigorous validation
- Malicious generation (hate speech, disinformation)

---

## Bias, Risks, and Limitations

- **Bias:** Sinhala corpora may reflect sociocultural biases (e.g., political, gender, religious biases).  
- **Limitations:** Model may underperform in complex reasoning tasks or in languages other than Sinhala. Writing-style classification is observed as particularly challenging.  
- **Risk:** Misuse in spreading misinformation or biased outputs in Sinhala.  

### Recommendations
Users should carefully evaluate outputs before deployment, especially in sensitive or safety-critical applications. Fine-tuning with task/domain-specific Sinhala data is required for robustness.

---

## How to Get Started with the Model

### Install dependencies
```python
!pip install unsloth
!pip install datasets==2.21.0
!pip install pandas==2.1.4
```

### Import dependencies
```python
from unsloth import FastLanguageModel, is_bfloat16_supported
from transformers import TextStreamer, AutoTokenizer
import torch
from datasets import load_dataset, DatasetDict, concatenate_datasets, Dataset
from collections import Counter, defaultdict
import os
import sys

from trl import SFTTrainer
from transformers import TrainingArguments, TextStreamer
import pandas as pd
```

### Load the base model
```python
model_config = {"model_name": "unsloth/llama-3-8b", "load_in_4bit": False}
max_seq_length = 2048 # Choose any! We auto support RoPE Scaling internally!
dtype = None # None for auto detection. Float16 for Tesla T4, V100, Bfloat16 for Ampere+
load_in_4bit = False # Use 4bit quantization to reduce memory usage. Can be False.
model_name = "polyglots/SinLlama_v01"
```

### Load the model
```python
model, _ = FastLanguageModel.from_pretrained(
    model_name = model_name,
    max_seq_length = max_seq_length,
    dtype = dtype,
    load_in_4bit = load_in_4bit,
    resize_model_vocab=139336 # Size of new vocab
)
```

### Load our extended tokenizer
```python
tokenizer = AutoTokenizer.from_pretrained("polyglots/Extended-Sinhala-LLaMA")
model.resize_token_embeddings(len(tokenizer))
```

## Training Details

### Training Data
- **Pretraining:** 10.7M Sinhala sentences (303.9M tokens) from MADLAD-400 and CulturaX, filtered for quality and cleaned:contentReference[oaicite:0]{index=0}.  
- **Fine-tuning:**  
  - Sentiment Analysis (~12.5K samples)  
  - Writing Style Classification (~9K samples)  
  - Sinhala News Category Classification (~3.3K samples)  

### Training Procedure
- **Tokenizer:** Extended Llama-3 tokenizer with Sinhala-specific tokens using `tiktoken`.  
- **Continual Pretraining:** Using codebase from Chinese-Llama, block size reduced from 1024 → 512 for GPU compatibility.  
- **Fine-tuning:** LoRA-based parameter-efficient finetuning with Alpaca-style prompts.  

#### Training Hyperparameters
- Mixed precision (fp16/bf16) training  
- LoRA adapters for efficient fine-tuning  

---

## Evaluation

### Testing Data
- Sinhala sentiment, writing style, and news categorization datasets.  
- Splits: 80/10/10 with stratified sampling.  

### Metrics
- Precision, Recall, F1-score  

### Results

| Model                  | Writing Style F1 | News F1 | Sentiment F1 |
|-------------------------|-----------------|---------|--------------|
| Llama-3-8B base         | 24.50           | 19.03   | 36.29        |
| Llama-3-8B base finetuned | 49.45        | 61.14   | 59.35        |
| Llama-3-8B instruct finetuned | 42.25   | 47.81   | 68.78        |
| **SinLlama finetuned**  | **58.89**       | **86.40** | **72.47**    |

**Summary:** SinLlama outperforms both base and instruct Llama-3-8B when fine-tuned, especially in news categorization and sentiment tasks:contentReference[oaicite:1]{index=1}.  

---

## Environmental Impact

- **Hardware Type:** GPUs (not specified, likely A100-class)  
- **Hours used:** Not reported  
- **Cloud Provider:** CSIR & Emojot infrastructure:contentReference[oaicite:2]{index=2}  
- **Compute Region:** India & Sri Lanka  
- **Carbon Emitted:** Not reported  

---

## Technical Specifications

### Model Architecture and Objective
- Decoder-only transformer (Llama-3-8B backbone)  
- Autoregressive pretraining objective  
- Sinhala vocabulary-extended tokenizer  

### Compute Infrastructure
- **Hardware:** GPUs provided by CSIR-CSIO and Emojot:contentReference[oaicite:3]{index=3}  
- **Software:** Hugging Face `transformers`, PEFT, LoRA, `tiktoken`  

---

## Citation

**BibTeX:**
```bibtex
@article{aravinda2025sinllama,
  title={SinLlama-A Large Language Model for Sinhala},
  author={Aravinda, H W K and Sirajudeen, Rashad and Karunathilake, Samith and de Silva, Nisansa and Ranathunga, Surangika and Kaur, Rishemjit},
  journal={arXiv preprint arXiv:2508.09115},
  year={2025}
}
```

**APA:**
Aravinda, H. W. K., Sirajudeen, R., Karunathilake, S., de Silva, N., Kaur, R., & Ranathunga, S. (2025). *SinLlama -- A Large Language Model for Sinhala*. arXiv preprint arXiv:2508.09115.  

---

## Model Card Authors
- Based on information from the SinLlama authors:contentReference[oaicite:4]{index=4}

## Model Card Contact
- [polyglots on Hugging Face](https://huggingface.co/polyglots)  

### Framework versions
- PEFT 0.13.2
- Transformers (latest at time of release)