Kurdish RoBERTa (Sorani)

RoBERTa is a pre-trained language model for Central Kurdish (Sorani) that provides high-quality contextual word embeddings. This model serves as a feature extractor .

Model Details

Architecture

  • Base Model: XLM-RoBERTa-large
  • Hidden Size: 1024
  • Layers: 24
  • Attention Heads: 16

Training Data

  • 1B token Kurdish corpus (KurdishTextCorpus)
  • Covers various domains including news, literature, and web text

Pretraining

  • Objective: Masked language modeling (15% dynamic masking)
  • Batch Size: 128
  • Sequence Length: 512 tokens
  • Training Hardware: 4× NVIDIA A100 GPUs

Uses

Direct Use

  • Feature extraction for Kurdish text
  • Contextual word embeddings

Downstream Use

  • Fine-tuning for:
    • Named Entity Recognition (NER)
    • Text classification
    • Question answering
    • Other sequence labeling tasks The corpus data tables and the detailed methodology can be found in the full research paper and are summarized here for quick reference:

Corpus Data Tables Summary

Table 1: AsoSoft Kurdish Text Corpus

Source Number of Tokens
Crawled From Websites 95M
Text Books 45M
Magazines 48M
Sum 188M

Table 2: Muhammad Azizi and AramRafeq Text Corpus

Source Number of Tokens
Wikipedia 13.5M
Wishe Website 11M
Speemedia Website 6.5M
Kurdiu Website 19M
Dengiamerika Website 2M
Chawg Website 8M
Sum 60M

Table 3: The Kurdish Text Corpus Used to Train BERT

Corpus Name Number of Tokens
Oscar 2019 corpus 48.5M
AsoSoft corpus 188M
Muhammad Azizi and AramRafeq corpus 60M
Sum 296.5M

How to Use

Feature Extraction

from transformers import AutoTokenizer, AutoModel

tokenizer = AutoTokenizer.from_pretrained("abdulhade/RoBERTa-large-SizeCorpus_1B")
model = AutoModel.from_pretrained("abdulhade/RoBERTa-large-SizeCorpus_1B")

text = "لیژنەی فتوا دەلێن سینگڵ زەکات وسەرفیترەی پێ دەشێت."
inputs = tokenizer(text, return_tensors="pt")

with torch.no_grad():
    outputs = model(**inputs)

# Contextual embeddings
embeddings = outputs.last_hidden_state

Cite

If you are using our text corpus cite us.

@article{abdullah2024ner,
  title={NER-RoBERTa: Fine-Tuning RoBERTa for Named Entity Recognition (NER) within Low-Resource Languages},
  author={Abdullah, Aso A and Abdulla, Sana H and Toufiq, Darya M and others},
  journal={arXiv preprint arXiv:2412.15252},
  year={2024}
}
Downloads last month
24
Safetensors
Model size
83.5M params
Tensor type
F32
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for abdulhade/RoBERTa-large-SizeCorpus_1B

Finetuned
(386)
this model

Dataset used to train abdulhade/RoBERTa-large-SizeCorpus_1B