Persian Masked Language Model (MLM)
This model is a Masked Language Model (MLM) trained on a 72.9-billion-token corpus of Persian text, making it one of the largest and most comprehensive models pre-trained exclusively for the Persian language. The model is designed to enhance language understanding tasks and provide high-quality contextual embeddings for various NLP applications in Persian.
- Our Paper: Advancing Retrieval-Augmented Generation for Persian: Development of Language Models, Comprehensive Benchmarks, and Best Practices for Optimization link
Model Details
Model Description
- Model Type: Masked Language Model (MLM)
- Base Model: XLM-RoBERTa Large
- Objective: Predicting randomly masked tokens within sequences
- Training Corpus Size: 72.9 billion tokens
- Maximum Sequence Length: 512 tokens
- Special Feature: No Next Sentence Prediction (NSP) task
Training Details
Training Configuration
- Hardware: 8 NVIDIA A800 GPUs
- Duration: One week
- Optimization Framework: DeepSpeed (Stage 0)
- Training Parameters:
- Learning Rate: 5e-5
- Maximum Sequence Length: 512 tokens
- Precision: FP16 (Mixed Precision)
Corpus
The model was pre-trained on a large-scale corpus of Persian text collected from diverse sources, ensuring broad language coverage and contextual diversity:
- Web-crawled data
- Academic articles and books
- Persian Wikipedia
- Religious texts
- Social media platforms
The data underwent extensive preprocessing, including deduplication and noise removal, to ensure high-quality training data.
Usage
The model can be used for various downstream NLP tasks in Persian, including:
- Text classification
- Named entity recognition
- Question answering
- Semantic search
- Contextual embedding generation
Example Usage
This model can be loaded and used with the 🤗 Transformers library:
from transformers import AutoTokenizer, AutoModelForMaskedLM
# Load tokenizer and model
tokenizer = AutoTokenizer.from_pretrained("your_model_id")
model = AutoModelForMaskedLM.from_pretrained("your_model_id")
# Example text
text = "این یک [MASK] جدید است."
inputs = tokenizer(text, return_tensors="pt")
# Predict the masked token
outputs = model(**inputs)
logits = outputs.logits
Training procedure
Training hyperparameters
The following hyperparameters were used during training:
- learning_rate: 5e-05
- train_batch_size: 30
- eval_batch_size: 8
- seed: 42
- distributed_type: multi-GPU
- num_devices: 8
- gradient_accumulation_steps: 2
- total_train_batch_size: 480
- total_eval_batch_size: 64
- optimizer: Use OptimizerNames.ADAMW_TORCH with betas=(0.9,0.999) and epsilon=1e-08 and optimizer_args=No additional optimizer arguments
- lr_scheduler_type: linear
- num_epochs: 1.0
- mixed_precision_training: Native AMP
Framework versions
- Transformers 4.47.0.dev0
- Pytorch 2.4.1+cu121
- Datasets 3.0.2
- Tokenizers 0.20.1
Citation
If you find this model helpful, please ensure to cite the following paper.
BibTeX:
@misc{hosseinbeigi2025advancingretrievalaugmentedgenerationpersian,
title={Advancing Retrieval-Augmented Generation for Persian: Development of Language Models, Comprehensive Benchmarks, and Best Practices for Optimization},
author={Sara Bourbour Hosseinbeigi and Sina Asghari and Mohammad Ali Seif Kashani and Mohammad Hossein Shalchian and Mohammad Amin Abbasi},
year={2025},
eprint={2501.04858},
archivePrefix={arXiv},
primaryClass={cs.CL},
url={https://arxiv.org/abs/2501.04858},
}
- Downloads last month
- -