|
--- |
|
language: fa |
|
library_name: sentence-transformers |
|
pipeline_tag: sentence-similarity |
|
tags: |
|
- cross-encoder |
|
- reranker |
|
- persian |
|
- farsi |
|
- xlm-roberta |
|
- scientific-qa |
|
dataset: |
|
- PersianSciQA |
|
--- |
|
|
|
# Cross-Encoder for Persian Scientific Relevance Ranking |
|
|
|
This is a cross-encoder model based on `xlm-roberta-large` that has been fine-tuned for relevance ranking of Persian scientific texts. It takes a question and a document (an abstract) as input and outputs a score from 0 to 1 indicating their relevance. |
|
|
|
This model was trained as a reranker for a Persian scientific Question Answering system. |
|
|
|
## Model Details |
|
|
|
- **Base Model:** `xlm-roberta-large` |
|
- **Task:** Reranking / Sentence Similarity |
|
- **Fine-tuning Framework:** `sentence-transformers` |
|
- **Language:** Persian (fa) |
|
|
|
## Intended Use |
|
|
|
The primary use of this model is to act as a **reranker** in a search or question-answering pipeline. Given a user's query and a list of candidate documents retrieved by a faster first-stage model (like BM25 or a bi-encoder), this cross-encoder can re-score the top candidates to provide a more accurate final ranking. |
|
|
|
### How to Use |
|
|
|
To use the model, first install the `sentence-transformers` library: |
|
```bash |
|
pip install -U sentence-transformers |
|
from sentence_transformers import CrossEncoder |
|
|
|
# Load the model from the Hugging Face Hub |
|
model_name = 'YOUR_HF_USERNAME/reranker-xlm-roberta-large' #<-- IMPORTANT: Replace with your model name! |
|
model = CrossEncoder(model_name) |
|
|
|
# Prepare your query and document pairs |
|
query = "روش های ارزیابی در بازیابی اطلاعات چیست؟" # "What are the evaluation methods in information retrieval?" |
|
documents = [ |
|
"بازیابی اطلاعات یک فرآیند پیچیده است که شامل شاخص گذاری و جستجوی اسناد می شود. ارزیابی آن اغلب با معیارهایی مانند دقت و بازیابی انجام می شود.", # "Information retrieval is a complex process involving indexing and searching documents. Its evaluation is often done with metrics like precision and recall." |
|
"یادگیری عمیق در سال های اخیر پیشرفت های چشمگیری در پردازش زبان طبیعی داشته است.", # "Deep learning has made significant progress in natural language processing in recent years." |
|
"این مقاله به بررسی روش های جدید برای ارزیابی سیستم های بازیابی اطلاعات معنایی می پردازد و معیارهای نوینی را معرفی می کند." # "This paper examines new methods for evaluating semantic information retrieval systems and introduces novel metrics." |
|
] |
|
|
|
# Create pairs for scoring |
|
sentence_pairs = [[query, doc] for doc in documents] |
|
|
|
# Predict the scores |
|
scores = model.predict(sentence_pairs, convert_to_numpy=True) |
|
|
|
# Print results |
|
for i in range(len(scores)): |
|
print(f"Score: {scores[i]:.4f}\t Document: {documents[i]}") |
|
|
|
# Expected output (scores will vary but should follow this trend): |
|
# Score: 0.9123 Document: This paper examines new methods for evaluating semantic information retrieval systems and introduces novel metrics. |
|
# Score: 0.7543 Document: Information retrieval is a complex process involving indexing and searching documents. Its evaluation is often done with metrics like precision and recall. |
|
# Score: 0.0123 Document: Deep learning has made significant progress in natural language processing in recent years. |
|
This model was fine-tuned on the |
|
|
|
PersianSciQA dataset. |
|
|
|
|
|
|
|
|
|
Description: PersianSciQA is a large-scale dataset containing 39,809 Persian scientific question-answer pairs. It was generated using a two-stage process with |
|
|
|
|
|
|
|
gpt-4o-mini on a corpus of scientific abstracts from IranDoc's 'Ganj' repository. |
|
|
|
|
|
|
|
|
|
|
|
Content: The dataset consists of questions paired with scientific abstracts, primarily from engineering fields. |
|
|
|
|
|
|
|
Labels: Each pair has a relevance score from 0 (Not Relevant) to 3 (Highly Relevant), which was normalized to a 0-1 float for training. |
|
|
|
|
|
Training Procedure |
|
The model was trained using the provided train_reranker.py script with the following configuration: |
|
|
|
Epochs: 2 |
|
|
|
Batch Size: 16 |
|
|
|
Learning Rate: 2e-5 |
|
|
|
Loss Function: MSELoss (default for regression in sentence-transformers) |
|
|
|
Evaluator: CECorrelationEvaluator was used to save the best model based on Spearman's rank correlation on the validation set. |
|
|
|
Evaluation |
|
The |
|
|
|
PersianSciQA paper reports substantial agreement between the LLM-assigned labels used for training and human expert judgments (Cohen's Kappa of 0.6642). The human validation study confirmed the high quality of the generated questions (88.60% acceptable) and the relevance assessments. |
|
|
|
|
|
|
|
Citation |
|
If you use this model or the PersianSciQA dataset in your research, please cite the original paper. |
|
|
|
(Note: The provided paper is a pre-print. Please update the citation information once it is officially published.) |
|
|
|
@inproceedings{PersianSciQA2025, |
|
title={PersianSciQA: A new Dataset for Bridging the Language Gap in Scientific Question Answering}, |
|
author={Anonymous}, |
|
year={2025}, |
|
booktitle={Proceedings of the International Conference on Recent Advances in Natural Language Processing (RANLP)}, |
|
note={Confidential review copy. To be updated upon publication.} |
|
} |