You need to agree to share your contact information to access this model

You agree to not use the model to conduct experiments that cause harm to human subjects.

Log in or Sign Up to review the conditions and access this model content.

Persian Masked Language Model (MLM)

This model is a Masked Language Model (MLM) trained on a 72.9-billion-token corpus of Persian text, making it one of the largest and most comprehensive models pre-trained exclusively for the Persian language. The model is designed to enhance language understanding tasks and provide high-quality contextual embeddings for various NLP applications in Persian.

  • Our Paper: Advancing Retrieval-Augmented Generation for Persian: Development of Language Models, Comprehensive Benchmarks, and Best Practices for Optimization link

Model Details

Model Description

  • Model Type: Masked Language Model (MLM)
  • Base Model: XLM-RoBERTa Large
  • Objective: Predicting randomly masked tokens within sequences
  • Training Corpus Size: 72.9 billion tokens
  • Maximum Sequence Length: 512 tokens
  • Special Feature: No Next Sentence Prediction (NSP) task

Training Details

Training Configuration

  • Hardware: 8 NVIDIA A800 GPUs
  • Duration: One week
  • Optimization Framework: DeepSpeed (Stage 0)
  • Training Parameters:
    • Learning Rate: 5e-5
    • Maximum Sequence Length: 512 tokens
    • Precision: FP16 (Mixed Precision)

Corpus

The model was pre-trained on a large-scale corpus of Persian text collected from diverse sources, ensuring broad language coverage and contextual diversity:

  • Web-crawled data
  • Academic articles and books
  • Persian Wikipedia
  • Religious texts
  • Social media platforms

The data underwent extensive preprocessing, including deduplication and noise removal, to ensure high-quality training data.

Usage

The model can be used for various downstream NLP tasks in Persian, including:

  • Text classification
  • Named entity recognition
  • Question answering
  • Semantic search
  • Contextual embedding generation

Example Usage

This model can be loaded and used with the 🤗 Transformers library:

from transformers import AutoTokenizer, AutoModelForMaskedLM

# Load tokenizer and model
tokenizer = AutoTokenizer.from_pretrained("your_model_id")
model = AutoModelForMaskedLM.from_pretrained("your_model_id")

# Example text
text = "این یک [MASK] جدید است."
inputs = tokenizer(text, return_tensors="pt")

# Predict the masked token
outputs = model(**inputs)
logits = outputs.logits

Training procedure

Training hyperparameters

The following hyperparameters were used during training:

  • learning_rate: 5e-05
  • train_batch_size: 30
  • eval_batch_size: 8
  • seed: 42
  • distributed_type: multi-GPU
  • num_devices: 8
  • gradient_accumulation_steps: 2
  • total_train_batch_size: 480
  • total_eval_batch_size: 64
  • optimizer: Use OptimizerNames.ADAMW_TORCH with betas=(0.9,0.999) and epsilon=1e-08 and optimizer_args=No additional optimizer arguments
  • lr_scheduler_type: linear
  • num_epochs: 1.0
  • mixed_precision_training: Native AMP

Framework versions

  • Transformers 4.47.0.dev0
  • Pytorch 2.4.1+cu121
  • Datasets 3.0.2
  • Tokenizers 0.20.1

Citation

If you find this model helpful, please ensure to cite the following paper.

BibTeX:

@misc{hosseinbeigi2025advancingretrievalaugmentedgenerationpersian,
      title={Advancing Retrieval-Augmented Generation for Persian: Development of Language Models, Comprehensive Benchmarks, and Best Practices for Optimization}, 
      author={Sara Bourbour Hosseinbeigi and Sina Asghari and Mohammad Ali Seif Kashani and Mohammad Hossein Shalchian and Mohammad Amin Abbasi},
      year={2025},
      eprint={2501.04858},
      archivePrefix={arXiv},
      primaryClass={cs.CL},
      url={https://arxiv.org/abs/2501.04858}, 
}
Downloads last month
-
Safetensors
Model size
560M params
Tensor type
F32
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support