XLM-Roberta Fine-Tuned on Amharic (MLM)
This model is a fine-tuned version of xlm-roberta-base
for the Amharic language (አማርኛ), trained with the Masked Language Modeling (MLM) objective. It is adapted to Amharic using a custom BPE tokenizer and embedding initialization based on FastText vectors.
🔧 Details
- Base model:
xlm-roberta-base
- Language: Amharic
- Tokenizer: Custom BPE tokenizer (not morpheme-aware)
- Adaptation: Embedding initialization via weighted average of pretrained XLM-R embeddings, guided by FastText word vectors for Amharic
- Training dataset: Amharic portion of the NLLB (No Language Left Behind) parallel corpus
- Objective: Masked Language Modeling (MLM)
🧪 Usage
from transformers import AutoTokenizer, AutoModelForMaskedLM
tokenizer = AutoTokenizer.from_pretrained("Hailay/xlmr-amharic-mlm")
model = AutoModelForMaskedLM.from_pretrained("Hailay/xlmr-amharic-mlm")
text = "ኢትዮጵያ ከፍተኛ እድገት አሳየች።"
inputs = tokenizer(text, return_tensors="pt")
outputs = model(**inputs)
📌 Intended Use
Pretraining for Amharic NLP tasks
Fine-tuning on classification, NER, QA, and downstream tasks in Amharic
Research on low-resource Semitic languages
📖 Citation
@misc{hailay2025amharic,
title={Amharic MLM with XLM-R and FastText-Informed Embedding Initialization},
author={Hailay Kidu},
year={2025},
url={https://huggingface.co/Hailay/xlmr-amharic-mlm}
}
🏷️ License
Apache License 2.0
- Downloads last month
- 12
Inference Providers
NEW
This model isn't deployed by any Inference Provider.
🙋
Ask for provider support