File size: 7,989 Bytes
c90df4a 6c741fd 0438d73 6c741fd 6d1e796 6c741fd 4f464b1 6d1e796 0438d73 6c741fd 0438d73 6c741fd 0438d73 6c741fd 0438d73 6c741fd 0438d73 6c741fd 0438d73 6c741fd 0438d73 6c741fd 3f1fe15 21416b6 3f1fe15 6c741fd 0438d73 6d1e796 0438d73 6c741fd 0438d73 6c741fd 0438d73 6c741fd 0438d73 6c741fd 6d1e796 6c741fd 0438d73 6c741fd a084456 0438d73 fb042ed a084456 fb042ed a084456 6c741fd a084456 fb042ed a084456 0438d73 a084456 0e28b73 6c741fd 0438d73 6c741fd 0438d73 6c741fd 0438d73 6c741fd 0438d73 6c741fd 0438d73 6c741fd 0438d73 6c741fd 0438d73 6c741fd 0438d73 6c741fd 0438d73 6c741fd 0438d73 6c741fd 0438d73 6c741fd 0438d73 6c741fd 0438d73 6c741fd 0438d73 6c741fd 0438d73 6c741fd 0438d73 6c741fd 0438d73 3a43063 0438d73 6c741fd 0438d73 6c741fd 0438d73 6c741fd 0438d73 6c741fd 0438d73 6c741fd 0438d73 c90df4a |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 |
---
datasets:
- polyglots/MADLAD_CulturaX_cleaned
language:
- si
metrics:
- precision
- recall
- f1
base_model:
- meta-llama/Meta-Llama-3-8B
library_name: peft
---
base_model: meta-llama/Meta-Llama-3-8B
library_name: peft
---
# Model Card for SinLlama
SinLlama is the first large language model specifically extended for Sinhala. It is based on Meta-Llama-3-8B and adapted through tokenizer vocabulary extension and continual pretraining on a 10M sentence Sinhala corpus. SinLlama significantly improves coverage and performance for Sinhala NLP tasks compared to base and instruct versions of Llama-3-8B.
*DISCLAIMER:*
This is a base model, which has NOT been instruct-tuned. So you still need to do task-specific fine-tuning.
---
## Model Details
### Model Description
SinLlama is a decoder-based large language model designed to improve NLP performance for Sinhala, a low-resource Indo-Aryan language spoken by ~20 million people in Sri Lanka. The model was developed by enhancing the Llama-3-8B tokenizer with Sinhala-specific vocabulary and performing continual pretraining on a cleaned and diverse 10.7M-sentence Sinhala corpus.
Subsequent fine-tuning on Sinhala classification datasets (news categorization, sentiment analysis, and writing style classification) shows significant improvements over baseline Llama-3-8B models.
- **Developed by:** H.W.K. Aravinda, Rashad Sirajudeen, Samith Karunathilake, Nisansa de Silva, Rishemjit Kaur, Surangika Ranathunga:contentReference[oaicite:1]{index=1}
- **Funded by:** CSIR - Central Scientific Instruments Organization (India), Emojot (Pvt) Ltd:contentReference[oaicite:2]{index=2}
- **Shared by:** Polyglots team
- **Model type:** Decoder-only autoregressive transformer LLM
- **Language(s) (NLP):** Sinhala (සිංහල)
- **License:** Same as base model (Meta Llama 3 license)
- **Finetuned from model:** meta-llama/Meta-Llama-3-8B
### Model Sources
- **Repository:** [Hugging Face - SinLlama v01](https://huggingface.co/polyglots/SinLlama_v01)
- **Paper:** [SinLlama: A Large Language Model for Sinhala](https://arxiv.org/abs/2508.09115v2)
- **Dataset:** [MADLAD+CulturaX (cleaned Sinhala subset)](https://huggingface.co/datasets/polyglots/MADLAD_CulturaX_cleaned)
---
### SinLlama Model Creation

## Uses
### Downstream Use
- Instruction tuning for Sinhala dialogue systems, text classification, etc
- Cross-lingual applications involving Sinhala
- Educational and research applications in low-resource NLP
### Out-of-Scope Use
- Applications requiring high accuracy in non-Sinhala languages (performance may degrade due to adaptation focus on Sinhala)
- Sensitive domains (e.g., healthcare, legal) without rigorous validation
- Malicious generation (hate speech, disinformation)
---
## Bias, Risks, and Limitations
- **Bias:** Sinhala corpora may reflect sociocultural biases (e.g., political, gender, religious biases).
- **Limitations:** Model may underperform in complex reasoning tasks or in languages other than Sinhala. Writing-style classification is observed as particularly challenging.
- **Risk:** Misuse in spreading misinformation or biased outputs in Sinhala.
### Recommendations
Users should carefully evaluate outputs before deployment, especially in sensitive or safety-critical applications. Fine-tuning with task/domain-specific Sinhala data is required for robustness.
---
## How to Get Started with the Model
### Install dependencies
```python
!pip install unsloth
!pip install datasets==2.21.0
!pip install pandas==2.1.4
```
### Import dependencies
```python
from unsloth import FastLanguageModel, is_bfloat16_supported
from transformers import TextStreamer, AutoTokenizer
import torch
from datasets import load_dataset, DatasetDict, concatenate_datasets, Dataset
from collections import Counter, defaultdict
import os
import sys
from trl import SFTTrainer
from transformers import TrainingArguments, TextStreamer
import pandas as pd
```
### Load the base model
```python
model_config = {"model_name": "unsloth/llama-3-8b", "load_in_4bit": False}
max_seq_length = 2048 # Choose any! We auto support RoPE Scaling internally!
dtype = None # None for auto detection. Float16 for Tesla T4, V100, Bfloat16 for Ampere+
load_in_4bit = False # Use 4bit quantization to reduce memory usage. Can be False.
model_name = "polyglots/SinLlama_v01"
```
### Load the model
```python
model, _ = FastLanguageModel.from_pretrained(
model_name = model_name,
max_seq_length = max_seq_length,
dtype = dtype,
load_in_4bit = load_in_4bit,
resize_model_vocab=139336 # Size of new vocab
)
```
### Load our extended tokenizer
```python
tokenizer = AutoTokenizer.from_pretrained("polyglots/Extended-Sinhala-LLaMA")
model.resize_token_embeddings(len(tokenizer))
```
## Training Details
### Training Data
- **Pretraining:** 10.7M Sinhala sentences (303.9M tokens) from MADLAD-400 and CulturaX, filtered for quality and cleaned:contentReference[oaicite:0]{index=0}.
- **Fine-tuning:**
- Sentiment Analysis (~12.5K samples)
- Writing Style Classification (~9K samples)
- Sinhala News Category Classification (~3.3K samples)
### Training Procedure
- **Tokenizer:** Extended Llama-3 tokenizer with Sinhala-specific tokens using `tiktoken`.
- **Continual Pretraining:** Using codebase from Chinese-Llama, block size reduced from 1024 → 512 for GPU compatibility.
- **Fine-tuning:** LoRA-based parameter-efficient finetuning with Alpaca-style prompts.
#### Training Hyperparameters
- Mixed precision (fp16/bf16) training
- LoRA adapters for efficient fine-tuning
---
## Evaluation
### Testing Data
- Sinhala sentiment, writing style, and news categorization datasets.
- Splits: 80/10/10 with stratified sampling.
### Metrics
- Precision, Recall, F1-score
### Results
| Model | Writing Style F1 | News F1 | Sentiment F1 |
|-------------------------|-----------------|---------|--------------|
| Llama-3-8B base | 24.50 | 19.03 | 36.29 |
| Llama-3-8B base finetuned | 49.45 | 61.14 | 59.35 |
| Llama-3-8B instruct finetuned | 42.25 | 47.81 | 68.78 |
| **SinLlama finetuned** | **58.89** | **86.40** | **72.47** |
**Summary:** SinLlama outperforms both base and instruct Llama-3-8B when fine-tuned, especially in news categorization and sentiment tasks:contentReference[oaicite:1]{index=1}.
---
## Environmental Impact
- **Hardware Type:** GPUs (not specified, likely A100-class)
- **Hours used:** Not reported
- **Cloud Provider:** CSIR & Emojot infrastructure:contentReference[oaicite:2]{index=2}
- **Compute Region:** India & Sri Lanka
- **Carbon Emitted:** Not reported
---
## Technical Specifications
### Model Architecture and Objective
- Decoder-only transformer (Llama-3-8B backbone)
- Autoregressive pretraining objective
- Sinhala vocabulary-extended tokenizer
### Compute Infrastructure
- **Hardware:** GPUs provided by CSIR-CSIO and Emojot:contentReference[oaicite:3]{index=3}
- **Software:** Hugging Face `transformers`, PEFT, LoRA, `tiktoken`
---
## Citation
**BibTeX:**
```bibtex
@article{aravinda2025sinllama,
title={SinLlama-A Large Language Model for Sinhala},
author={Aravinda, H W K and Sirajudeen, Rashad and Karunathilake, Samith and de Silva, Nisansa and Ranathunga, Surangika and Kaur, Rishemjit},
journal={arXiv preprint arXiv:2508.09115},
year={2025}
}
```
**APA:**
Aravinda, H. W. K., Sirajudeen, R., Karunathilake, S., de Silva, N., Kaur, R., & Ranathunga, S. (2025). *SinLlama -- A Large Language Model for Sinhala*. arXiv preprint arXiv:2508.09115.
---
## Model Card Authors
- Based on information from the SinLlama authors:contentReference[oaicite:4]{index=4}
## Model Card Contact
- [polyglots on Hugging Face](https://huggingface.co/polyglots)
### Framework versions
- PEFT 0.13.2
- Transformers (latest at time of release) |