--- datasets: - polyglots/MADLAD_CulturaX_cleaned language: - si metrics: - precision - recall - f1 base_model: - meta-llama/Meta-Llama-3-8B library_name: peft --- base_model: meta-llama/Meta-Llama-3-8B library_name: peft --- # Model Card for SinLlama SinLlama is the first large language model specifically extended for Sinhala. It is based on Meta-Llama-3-8B and adapted through tokenizer vocabulary extension and continual pretraining on a 10M sentence Sinhala corpus. SinLlama significantly improves coverage and performance for Sinhala NLP tasks compared to base and instruct versions of Llama-3-8B. --- ## Model Details ### Model Description SinLlama is a decoder-based large language model designed to improve NLP performance for Sinhala, a low-resource Indo-Aryan language spoken by ~20 million people in Sri Lanka. The model was developed by enhancing the Llama-3-8B tokenizer with Sinhala-specific vocabulary and performing continual pretraining on a cleaned and diverse 10.7M-sentence Sinhala corpus. Subsequent fine-tuning on Sinhala classification datasets (news categorization, sentiment analysis, and writing style classification) shows significant improvements over baseline Llama-3-8B models. - **Developed by:** H.W.K. Aravinda, Rashad Sirajudeen, Samith Karunathilake, Nisansa de Silva, Rishemjit Kaur, Surangika Ranathunga:contentReference[oaicite:1]{index=1} - **Funded by:** CSIR - Central Scientific Instruments Organization (India), Emojot (Pvt) Ltd:contentReference[oaicite:2]{index=2} - **Shared by:** Polyglots team - **Model type:** Decoder-only autoregressive transformer LLM - **Language(s) (NLP):** Sinhala (සිංහල) - **License:** Same as base model (Meta Llama 3 license) - **Finetuned from model:** meta-llama/Meta-Llama-3-8B ### Model Sources - **Repository:** [Hugging Face - SinLlama v01](https://huggingface.co/polyglots/SinLlama_v01) - **Paper:** [SinLlama: A Large Language Model for Sinhala](https://arxiv.org/abs/2508.09115v2) - **Dataset:** [MADLAD+CulturaX (cleaned Sinhala subset)](https://huggingface.co/datasets/polyglots/MADLAD_CulturaX_cleaned) --- ### SinLlama Model Creation ![SinLlama Logo](https://huggingface.co/datasets/polyglots/assets/resolve/main/sinllama_logo.png) ## Uses ### Direct Use - Sinhala text generation - Sinhala text classification - Sentiment analysis, news categorization, and writing style classification ### Downstream Use - Instruction tuning for Sinhala dialogue systems - Cross-lingual applications involving Sinhala - Educational and research applications in low-resource NLP ### Out-of-Scope Use - Applications requiring high accuracy in non-Sinhala languages (performance may degrade due to adaptation focus on Sinhala) - Sensitive domains (e.g., healthcare, legal) without rigorous validation - Malicious generation (hate speech, disinformation) --- ## Bias, Risks, and Limitations - **Bias:** Sinhala corpora may reflect sociocultural biases (e.g., political, gender, religious biases). - **Limitations:** Model may underperform in complex reasoning tasks or in languages other than Sinhala. Writing-style classification is observed as particularly challenging. - **Risk:** Misuse in spreading misinformation or biased outputs in Sinhala. ### Recommendations Users should carefully evaluate outputs before deployment, especially in sensitive or safety-critical applications. Fine-tuning with task/domain-specific Sinhala data is recommended for robustness. --- ## How to Get Started with the Model ### Install dependencies ```python !pip install unsloth # @ git+https://github.com/unslothai/unsloth.git !pip install datasets==2.21.0 !pip install pandas==2.1.4 ``` ### Import dependencies ```python from unsloth import FastLanguageModel, is_bfloat16_supported from transformers import TextStreamer, AutoTokenizer import torch from datasets import load_dataset, DatasetDict, concatenate_datasets, Dataset from collections import Counter, defaultdict import os import sys from trl import SFTTrainer from transformers import TrainingArguments, TextStreamer import pandas as pd ``` ### Load the base model ```python model_config = {"model_name": "unsloth/llama-3-8b", "load_in_4bit": False} max_seq_length = 2048 # Choose any! We auto support RoPE Scaling internally! dtype = None # None for auto detection. Float16 for Tesla T4, V100, Bfloat16 for Ampere+ load_in_4bit = False # Use 4bit quantization to reduce memory usage. Can be False. model_name = "polyglots/SinLlama_v01" # Change the model name ``` ### Load the model ```python model, _ = FastLanguageModel.from_pretrained( model_name = model_name, max_seq_length = max_seq_length, dtype = dtype, load_in_4bit = load_in_4bit, resize_model_vocab=139336, # token = "hf_...", # use one if using gated models like meta-llama/Llama-2-7b-hf ) ``` ### Load our extended tokenizer ```python tokenizer = AutoTokenizer.from_pretrained("polyglots/Extended-Sinhala-LLaMA") model.resize_token_embeddings(len(tokenizer)) ``` ## Training Details ### Training Data - **Pretraining:** 10.7M Sinhala sentences (303.9M tokens) from MADLAD-400 and CulturaX, filtered for quality and cleaned:contentReference[oaicite:0]{index=0}. - **Fine-tuning:** - Sentiment Analysis (~12.5K samples) - Writing Style Classification (~9K samples) - Sinhala News Category Classification (~3.3K samples) ### Training Procedure - **Tokenizer:** Extended Llama-3 tokenizer with Sinhala-specific tokens using `tiktoken`. - **Continual Pretraining:** Using codebase from Chinese-Llama, block size reduced from 1024 → 512 for GPU compatibility. - **Fine-tuning:** LoRA-based parameter-efficient finetuning with Alpaca-style prompts. #### Training Hyperparameters - Mixed precision (fp16/bf16) training - LoRA adapters for efficient fine-tuning --- ## Evaluation ### Testing Data - Sinhala sentiment, writing style, and news categorization datasets. - Splits: 80/10/10 with stratified sampling. ### Metrics - Precision, Recall, F1-score ### Results | Model | Writing Style F1 | News F1 | Sentiment F1 | |-------------------------|-----------------|---------|--------------| | Llama-3-8B base | 24.50 | 19.03 | 36.29 | | Llama-3-8B base finetuned | 49.45 | 61.14 | 59.35 | | Llama-3-8B instruct finetuned | 42.25 | 47.81 | 68.78 | | **SinLlama finetuned** | **58.89** | **86.40** | **72.47** | **Summary:** SinLlama outperforms both base and instruct Llama-3-8B when fine-tuned, especially in news categorization and sentiment tasks:contentReference[oaicite:1]{index=1}. --- ## Environmental Impact - **Hardware Type:** GPUs (not specified, likely A100-class) - **Hours used:** Not reported - **Cloud Provider:** CSIR & Emojot infrastructure:contentReference[oaicite:2]{index=2} - **Compute Region:** India & Sri Lanka - **Carbon Emitted:** Not reported --- ## Technical Specifications ### Model Architecture and Objective - Decoder-only transformer (Llama-3-8B backbone) - Autoregressive pretraining objective - Sinhala vocabulary-extended tokenizer ### Compute Infrastructure - **Hardware:** GPUs provided by CSIR-CSIO and Emojot:contentReference[oaicite:3]{index=3} - **Software:** Hugging Face `transformers`, PEFT, LoRA, `tiktoken` --- ## Citation **BibTeX:** ```bibtex @article{aravinda2025sinllama, title={SinLlama-A Large Language Model for Sinhala}, author={Aravinda, H W K and Sirajudeen, Rashad and Karunathilake, Samith and de Silva, Nisansa and Ranathunga, Surangika and Kaur, Rishemjit}, journal={arXiv preprint arXiv:2508.09115}, year={2025} } ``` **APA:** Aravinda, H. W. K., Sirajudeen, R., Karunathilake, S., de Silva, N., Kaur, R., & Ranathunga, S. (2025). *SinLlama -- A Large Language Model for Sinhala*. arXiv preprint arXiv:2508.09115. --- ## Model Card Authors - Based on information from the SinLlama authors:contentReference[oaicite:4]{index=4} ## Model Card Contact - [polyglots on Hugging Face](https://huggingface.co/polyglots) ### Framework versions - PEFT 0.13.2 - Transformers (latest at time of release)