Model Card for AmoooEBI/Bert-fa-qa-finetuned-on-PersianQA

This model is a version of ParsBERT, fine-tuned for extractive question answering on the Persian language using the PersianQA dataset.

Model Details

Model Description

This is a ParsBERT model fine-tuned on the SajjadAyoubi/persian_qa dataset. It is designed for extractive question answering, meaning it extracts the answer to a question directly from a given context. The fine-tuning process has significantly improved its ability to understand and respond to questions in Persian compared to the base model.

Developed by: Amir Mohammad Ebrahiminasab
Shared by: Amir Mohammad Ebrahiminasab
Model type: bert
Language(s) (NLP): fa (Persian)
License: MIT
Finetuned from model: pedramyazdipoor/parsbert_Youtubeing_PQuAD

Model Sources

Repository: https://huggingface.co/AmoooEBI/Bert-fa-qa-finetuned-on-PersianQA
Demo: https://huggingface.co/spaces/AmoooEBI/ParsBert-QA-Chatbot

Uses

Direct Use

The model can be used for extractive question answering in Persian. You can provide a context and a question, and the model will extract the answer span from the context.

from transformers import pipeline

qa_pipeline = pipeline(
    "question-answering",
    model="AmoooEBI/Bert-fa-qa-finetuned-on-PersianQA",
    tokenizer="AmoooEBI/Bert-fa-qa-finetuned-on-PersianQA"
)

context = "فرهاد مجیدی قادیکلایی مشهور به فرهاد مجیدی بازیکن فوتبال اهل ایران است. او همچنین سابقه بازی در باشگاه استقلال را در کارنامه دارد."
question = "فرهاد مجیدی در چه تیمی سابقه بازی دارد؟"

result = qa_pipeline(question=question, context=context)
# {'score': 0.99..., 'start': 101, 'end': 108, 'answer': 'استقلال'}

print(f"Answer: '{result['answer']}'")

Bias, Risks, and Limitations

The model's performance is directly influenced by the content of the PersianQA dataset. It may not perform as well on contexts from different domains or with different linguistic styles. The model shows a performance drop for answers that are longer than the dataset's average, indicating a potential bias towards extracting shorter text spans.

Recommendations

Users should be aware of the model's limitations, especially its reduced accuracy on longer answer spans. For critical applications, the model's outputs should be verified.

How to Get Started with the Model

Use the code below to get started with the model using PyTorch.

from transformers import AutoTokenizer, AutoModelForQuestionAnswering
import torch

tokenizer = AutoTokenizer.from_pretrained("AmoooEBI/Bert-fa-qa-finetuned-on-PersianQA")
model = AutoModelForQuestionAnswering.from_pretrained("AmoooEBI/Bert-fa-qa-finetuned-on-PersianQA")

context = "پایتخت اسپانیا شهر مادرید است."
question = "پایتخت اسپانیا کجاست؟"

inputs = tokenizer(question, context, return_tensors="pt")
with torch.no_grad():
    outputs = model(**inputs)

answer_start_index = outputs.start_logits.argmax()
answer_end_index = outputs.end_logits.argmax()

predict_answer_tokens = inputs.input_ids[0, answer_start_index : answer_end_index + 1]
answer = tokenizer.decode(predict_answer_tokens)

print(f"Question: {question}")
print(f"Answer: {answer}")
# Answer: مادرید

Training Details

Training Data

The model was fine-tuned on the SajjadAyoubi/persian_qa dataset, which contains question-context-answer triplets in Persian.

Training Procedure

Preprocessing

The training data was preprocessed by tokenizing question and context pairs. Long contexts were handled by creating multiple features for a single example using a sliding window approach (doc_stride). The start and end token positions for the answer were identified in the tokenized input.

Training Hyperparameters

The model was trained with the following hyperparameters:

Argument	Value
Learning Rate	$2 \times 10^{-5}$
Training Epochs	10
Train Batch Size	8
Evaluation Batch Size	8
Weight Decay	0.01
Scheduler Type	Cosine
Warmup Ratio	0.1
Best Model Metric	F1-Score

Speeds, Sizes, Times

The full fine-tuning process took approximately 1 hour and 22 minutes on a single GPU.

Evaluation

The model was evaluated on the validation split of the SajjadAyoubi/persian_qa dataset.

Testing Data, Factors & Metrics

Testing Data

The evaluation was performed on the validation set of the SajjadAyoubi/persian_qa dataset.

Factors

The model's performance was analyzed based on two factors:

Answer Presence: Performance was measured separately for questions that have an answer in the context versus those that do not.
Answer Length: Performance was analyzed for answers shorter than the validation set average (22.78 characters) and those longer than the average.

Metrics

F1-Score: The primary metric, measuring the harmonic mean of precision and recall on token overlap.
Exact Match (EM): The percentage of predictions that perfectly match the ground truth answer.

Results

Summary

Overall Performance on the Validation Set

Model Status	Exact Match	F1-Score
Fine-Tuned Model (10 Epochs)	55.59%	71.97%

Performance on Data Subsets

Case Type	Exact Match	F1-Score
Has Answer	44.70%	68.22%
No Answer	78.14%	78.14%

Answer Length	Exact Match	F1-Score
Longer than Avg.	38.56%	69.80%
Shorter than Avg.	53.01%	68.88%

Environmental Impact

Carbon emissions can be estimated using the Machine Learning Impact calculator.

Hardware Type: T4 GPU
Hours used: 1.37
Cloud Provider: Google Colab
Carbon Emitted: [Not Calculated]

Technical Specifications

Model Architecture and Objective

The model is a BERT-base architecture with a linear layer on top of the hidden-states output for extractive question answering. The objective was to minimize the cross-entropy loss for the start and end token positions of the answer.

Compute Infrastructure

Hardware

The model was trained on a single NVIDIA T4 GPU.

Software

transformers
torch
datasets
evaluate

Model Card Authors

Amir Mohammad Ebrahiminasab

Model Card Contact

[email protected]

AmoooEBI
/

Bert-fa-qa-finetuned-on-PersianQA