Kurdish RoBERTa (Sorani)
RoBERTa is a pre-trained language model for Central Kurdish (Sorani) that provides high-quality contextual word embeddings. This model serves as a feature extractor .
Model Details
Architecture
- Base Model: XLM-RoBERTa-large
- Hidden Size: 1024
- Layers: 24
- Attention Heads: 16
Training Data
- 1B token Kurdish corpus (KurdishTextCorpus)
- Covers various domains including news, literature, and web text
Pretraining
- Objective: Masked language modeling (15% dynamic masking)
- Batch Size: 128
- Sequence Length: 512 tokens
- Training Hardware: 4× NVIDIA A100 GPUs
Uses
Direct Use
- Feature extraction for Kurdish text
- Contextual word embeddings
Downstream Use
- Fine-tuning for:
- Named Entity Recognition (NER)
- Text classification
- Question answering
- Other sequence labeling tasks The corpus data tables and the detailed methodology can be found in the full research paper and are summarized here for quick reference:
Corpus Data Tables Summary
Table 1: AsoSoft Kurdish Text Corpus
Source | Number of Tokens |
---|---|
Crawled From Websites | 95M |
Text Books | 45M |
Magazines | 48M |
Sum | 188M |
Table 2: Muhammad Azizi and AramRafeq Text Corpus
Source | Number of Tokens |
---|---|
Wikipedia | 13.5M |
Wishe Website | 11M |
Speemedia Website | 6.5M |
Kurdiu Website | 19M |
Dengiamerika Website | 2M |
Chawg Website | 8M |
Sum | 60M |
Table 3: The Kurdish Text Corpus Used to Train BERT
Corpus Name | Number of Tokens |
---|---|
Oscar 2019 corpus | 48.5M |
AsoSoft corpus | 188M |
Muhammad Azizi and AramRafeq corpus | 60M |
Sum | 296.5M |
How to Use
Feature Extraction
from transformers import AutoTokenizer, AutoModel
tokenizer = AutoTokenizer.from_pretrained("abdulhade/RoBERTa-large-SizeCorpus_1B")
model = AutoModel.from_pretrained("abdulhade/RoBERTa-large-SizeCorpus_1B")
text = "لیژنەی فتوا دەلێن سینگڵ زەکات وسەرفیترەی پێ دەشێت."
inputs = tokenizer(text, return_tensors="pt")
with torch.no_grad():
outputs = model(**inputs)
# Contextual embeddings
embeddings = outputs.last_hidden_state
Cite
If you are using our text corpus cite us.
@article{abdullah2024ner,
title={NER-RoBERTa: Fine-Tuning RoBERTa for Named Entity Recognition (NER) within Low-Resource Languages},
author={Abdullah, Aso A and Abdulla, Sana H and Toufiq, Darya M and others},
journal={arXiv preprint arXiv:2412.15252},
year={2024}
}
- Downloads last month
- 24
Inference Providers
NEW
This model isn't deployed by any Inference Provider.
🙋
Ask for provider support
Model tree for abdulhade/RoBERTa-large-SizeCorpus_1B
Base model
FacebookAI/xlm-roberta-large