fabert / README.md
M98M's picture
Update README.md
6aa85d9 verified
metadata
language:
  - fa
library_name: transformers
widget:
  - text: ز سوزناکی گفتار من [MASK] بگریست
    example_title: Poetry 1
  - text: نظر از تو برنگیرم همه [MASK] تا بمیرم که تو در دلم نشستی و سر مقام داری
    example_title: Poetry 2
  - text: هر ساعتم اندرون بجوشد [MASK] را وآگاهی نیست مردم بیرون را
    example_title: Poetry 3
  - text: غلام همت آن رند عافیت سوزم که در گدا صفتی [MASK] داند
    example_title: Poetry 4
  - text: این [MASK] اولشه.
    example_title: Informal 1
  - text: دیگه خسته شدم! [MASK] اینم شد کار؟!
    example_title: Informal 2
  - text: فکر نکنم به موقع برسیم. بهتره [MASK] این یکی بشیم.
    example_title: Informal 3
  - text: تا صبح بیدار موندم و داشتم برای [MASK] آماده می شدم.
    example_title: Informal 4
  - text: زندگی بدون [MASK] خسته‌کننده است.
    example_title: Formal 1
  - text: >-
      در حکم اولیه این شرکت مجاز به فعالیت شد ولی پس از بررسی مجدد، مجوز این
      شرکت [MASK] شد.
    example_title: Formal 2

FaBERT: Pre-training BERT on Persian Blogs

Model Details

FaBERT is a Persian BERT-base model trained on the diverse HmBlogs corpus, encompassing both casual and formal Persian texts. Developed for natural language processing tasks, FaBERT is a robust solution for processing Persian text. Through evaluation across various Natural Language Understanding (NLU) tasks, FaBERT consistently demonstrates notable improvements, while having a compact model size. Now available on Hugging Face, integrating FaBERT into your projects is hassle-free. Experience enhanced performance without added complexity as FaBERT tackles a variety of NLP tasks.

Features

  • Pre-trained on the diverse HmBlogs corpus consisting more than 50 GB of text from Persian Blogs
  • Remarkable performance across various downstream NLP tasks
  • BERT architecture with 124 million parameters

Useful Links

Usage

Loading the Model with MLM head

from transformers import AutoTokenizer, AutoModelForMaskedLM

tokenizer = AutoTokenizer.from_pretrained("sbunlp/fabert") # make sure to use the default fast tokenizer
model = AutoModelForMaskedLM.from_pretrained("sbunlp/fabert")

Downstream Tasks

Similar to the original English BERT, FaBERT can be fine-tuned on many downstream tasks.(https://huggingface.co/docs/transformers/en/training)

Examples on Persian datasets are available in our GitHub repository.

make sure to use the default Fast Tokenizer

Training Details

FaBERT was pre-trained with the MLM (WWM) objective, and the resulting perplexity on validation set was 7.76.

Hyperparameter Value
Batch Size 32
Optimizer Adam
Learning Rate 6e-5
Weight Decay 0.01
Total Steps 18 Million
Warmup Steps 1.8 Million
Precision Format TF32

Evaluation

Here are some key performance results for the FaBERT model:

Sentiment Analysis

Task FaBERT ParsBERT XLM-R
MirasOpinion 87.51 86.73 84.92
MirasIrony 74.82 71.08 75.51
DeepSentiPers 79.85 74.94 79.00

Named Entity Recognition

Task FaBERT ParsBERT XLM-R
PEYMA 91.39 91.24 90.91
ParsTwiner 82.22 81.13 79.50
MultiCoNER v2 57.92 58.09 51.47

Question Answering

Task FaBERT ParsBERT XLM-R
ParsiNLU 55.87 44.89 42.55
PQuAD 87.34 86.89 87.60
PCoQA 53.51 50.96 51.12

Natural Language Inference & QQP

Task FaBERT ParsBERT XLM-R
FarsTail 84.45 82.52 83.50
SBU-NLI 66.65 58.41 58.85
ParsiNLU QQP 82.62 77.60 79.74

Number of Parameters

FaBERT ParsBERT XLM-R
Parameter Count (M) 124 162 278
Vocabulary Size (K) 50 100 250

For a more detailed performance analysis refer to the paper.

How to Cite

If you use FaBERT in your research or projects, please cite it using the following BibTeX:

@article{masumi2024fabert,
  title={FaBERT: Pre-training BERT on Persian Blogs},
  author={Masumi, Mostafa and Majd, Seyed Soroush and Shamsfard, Mehrnoush and Beigy, Hamid},
  journal={arXiv preprint arXiv:2402.06617},
  year={2024}
}