XLM-Roberta Fine-Tuned on Tigrinya (MLM)
This model is a fine-tuned version of xlm-roberta-base
for the Tigrinya language (α΅ααα), trained with the Masked Language Modeling (MLM) objective. It uses a custom BPE tokenizer adapted to Tigrinya using FastText-informed embedding initialization.
π§ Details
- Base model:
xlm-roberta-base
- Language: Tigrinya
- Tokenizer: Custom BPE tokenizer (non-morpheme-aware)
- Adaptation: Embedding initialization using weighted averages of pretrained XLM-R embeddings, guided by Tigrinya FastText word vectors
- Training dataset: Tigrinya side of the NLLB (No Language Left Behind) parallel corpus
- Objective: Masked Language Modeling (MLM)
π§ͺ Usage
from transformers import AutoTokenizer, AutoModelForMaskedLM
tokenizer = AutoTokenizer.from_pretrained("Hailay/xlmr-tigriyna-mlm")
model = AutoModelForMaskedLM.from_pretrained("Hailay/xlmr-tigriyna-mlm")
text = "α΅αα«α α₯αα΅αα₯α£ αα
αα’ αα₯αͺ ααΊαα’"
inputs = tokenizer(text, return_tensors="pt")
outputs = model(**inputs)
π Intended Use
Pretraining for Tigrinya NLP tasks
Fine-tuning on classification, NER, QA, and other downstream tasks in Tigrinya
Research in low-resource Semitic and morphologically rich languages
π Citation
@misc{hailay2025tigrinya,
title={Tigrinya MLM with XLM-R and FastText-Informed Embedding Initialization},
author={Hailay Kidu},
year={2025},
url={https://huggingface.co/Hailay/xlmr-tigriyna-mlm}
}
π·οΈ License
Apache License 2.0
- Downloads last month
- 10
Inference Providers
NEW
This model isn't deployed by any Inference Provider.
π
Ask for provider support