Update README.md
Browse files
README.md
CHANGED
@@ -1 +1,121 @@
|
|
1 |
-
---
|
2 |
-
cross-encoder
|
3 |
-
reranker
|
4 |
-
persian
|
5 |
-
farsi
|
6 |
-
xlm-roberta
|
7 |
-
scientific-qa
|
8 |
-
PersianSciQA
|
9 |
-
--
|
10 |
-
**Base Model:** `xlm-roberta-large`
|
11 |
-
**Task:** Reranking / Sentence Similarity
|
12 |
-
**Fine-tuning Framework:** `sentence-transformers`
|
13 |
-
**Language:** Persian (fa)
|
14 |
"بازیابی اطلاعات یک فرآیند پیچیده است که شامل شاخص گذاری و جستجوی اسناد می شود. ارزیابی آن اغلب با معیارهایی مانند دقت و بازیابی انجام می شود.", # "Information retrieval is a complex process involving indexing and searching documents. Its evaluation is often done with metrics like precision and recall."
|
15 |
"یادگیری عمیق در سال های اخیر پیشرفت های چشمگیری در پردازش زبان طبیعی داشته است.", # "Deep learning has made significant progress in natural language processing in recent years."
|
16 |
"این مقاله به بررسی روش های جدید برای ارزیابی سیستم های بازیابی اطلاعات معنایی می پردازد و معیارهای نوینی را معرفی می کند." # "This paper examines new methods for evaluating semantic information retrieval systems and introduces novel metrics."
|
17 |
print(f"Score: {scores[i]:.4f}\t Document: {documents[i]}")
|
18 |
title={PersianSciQA: A new Dataset for Bridging the Language Gap in Scientific Question Answering},
|
19 |
author={Anonymous},
|
20 |
year={2025},
|
21 |
booktitle={Proceedings of the International Conference on Recent Advances in Natural Language Processing (RANLP)},
|
22 |
note={Confidential review copy. To be updated upon publication.}
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
"بازیابی اطلاعات یک فرآیند پیچیده است که شامل شاخص گذاری و جستجوی اسناد می شود. ارزیابی آن اغلب با معیارهایی مانند دقت و بازیابی انجام می شود.", # "Information retrieval is a complex process involving indexing and searching documents. Its evaluation is often done with metrics like precision and recall."
|
2 |
"یادگیری عمیق در سال های اخیر پیشرفت های چشمگیری در پردازش زبان طبیعی داشته است.", # "Deep learning has made significant progress in natural language processing in recent years."
|
3 |
"این مقاله به بررسی روش های جدید برای ارزیابی سیستم های بازیابی اطلاعات معنایی می پردازد و معیارهای نوینی را معرفی می کند." # "This paper examines new methods for evaluating semantic information retrieval systems and introduces novel metrics."
|
4 |
print(f"Score: {scores[i]:.4f}\t Document: {documents[i]}")
|
5 |
title={PersianSciQA: A new Dataset for Bridging the Language Gap in Scientific Question Answering},
|
6 |
author={Anonymous},
|
7 |
year={2025},
|
8 |
booktitle={Proceedings of the International Conference on Recent Advances in Natural Language Processing (RANLP)},
|
9 |
note={Confidential review copy. To be updated upon publication.}
|
10 |
+
---
|
11 |
+
language: fa
|
12 |
+
library_name: sentence-transformers
|
13 |
+
pipeline_tag: sentence-similarity
|
14 |
+
tags:
|
15 |
+
- cross-encoder
|
16 |
+
- reranker
|
17 |
+
- persian
|
18 |
+
- farsi
|
19 |
+
- xlm-roberta
|
20 |
+
- scientific-qa
|
21 |
+
dataset:
|
22 |
+
- PersianSciQA
|
23 |
+
---
|
24 |
+
|
25 |
+
# Cross-Encoder for Persian Scientific Relevance Ranking
|
26 |
+
|
27 |
+
This is a cross-encoder model based on `xlm-roberta-large` that has been fine-tuned for relevance ranking of Persian scientific texts. It takes a question and a document (an abstract) as input and outputs a score from 0 to 1 indicating their relevance.
|
28 |
+
|
29 |
+
This model was trained as a reranker for a Persian scientific Question Answering system.
|
30 |
+
|
31 |
+
## Model Details
|
32 |
+
|
33 |
+
- **Base Model:** `xlm-roberta-large`
|
34 |
+
- **Task:** Reranking / Sentence Similarity
|
35 |
+
- **Fine-tuning Framework:** `sentence-transformers`
|
36 |
+
- **Language:** Persian (fa)
|
37 |
+
|
38 |
+
## Intended Use
|
39 |
+
|
40 |
+
The primary use of this model is to act as a **reranker** in a search or question-answering pipeline. Given a user's query and a list of candidate documents retrieved by a faster first-stage model (like BM25 or a bi-encoder), this cross-encoder can re-score the top candidates to provide a more accurate final ranking.
|
41 |
+
|
42 |
+
### How to Use
|
43 |
+
|
44 |
+
To use the model, first install the `sentence-transformers` library:
|
45 |
+
```bash
|
46 |
+
pip install -U sentence-transformers
|
47 |
+
from sentence_transformers import CrossEncoder
|
48 |
+
|
49 |
+
# Load the model from the Hugging Face Hub
|
50 |
+
model_name = 'YOUR_HF_USERNAME/reranker-xlm-roberta-large' #<-- IMPORTANT: Replace with your model name!
|
51 |
+
model = CrossEncoder(model_name)
|
52 |
+
|
53 |
+
# Prepare your query and document pairs
|
54 |
+
query = "روش های ارزیابی در بازیابی اطلاعات چیست؟" # "What are the evaluation methods in information retrieval?"
|
55 |
+
documents = [
|
56 |
+
"بازیابی اطلاعات یک فرآیند پیچیده است که شامل شاخص گذاری و جستجوی اسناد می شود. ارزیابی آن اغلب با معیارهایی مانند دقت و بازیابی انجام می شود.", # "Information retrieval is a complex process involving indexing and searching documents. Its evaluation is often done with metrics like precision and recall."
|
57 |
+
"یادگیری عمیق در سال های اخیر پیشرفت های چشمگیری در پردازش زبان طبیعی داشته است.", # "Deep learning has made significant progress in natural language processing in recent years."
|
58 |
+
"این مقاله به بررسی روش های جدید برای ارزیابی سیستم های بازیابی اطلاعات معنایی می پردازد و معیارهای نوینی را معرفی می کند." # "This paper examines new methods for evaluating semantic information retrieval systems and introduces novel metrics."
|
59 |
+
]
|
60 |
+
|
61 |
+
# Create pairs for scoring
|
62 |
+
sentence_pairs = [[query, doc] for doc in documents]
|
63 |
+
|
64 |
+
# Predict the scores
|
65 |
+
scores = model.predict(sentence_pairs, convert_to_numpy=True)
|
66 |
+
|
67 |
+
# Print results
|
68 |
+
for i in range(len(scores)):
|
69 |
+
print(f"Score: {scores[i]:.4f}\t Document: {documents[i]}")
|
70 |
+
|
71 |
+
# Expected output (scores will vary but should follow this trend):
|
72 |
+
# Score: 0.9123 Document: This paper examines new methods for evaluating semantic information retrieval systems and introduces novel metrics.
|
73 |
+
# Score: 0.7543 Document: Information retrieval is a complex process involving indexing and searching documents. Its evaluation is often done with metrics like precision and recall.
|
74 |
+
# Score: 0.0123 Document: Deep learning has made significant progress in natural language processing in recent years.
|
75 |
+
This model was fine-tuned on the
|
76 |
+
|
77 |
+
PersianSciQA dataset.
|
78 |
+
|
79 |
+
|
80 |
+
|
81 |
+
|
82 |
+
Description: PersianSciQA is a large-scale dataset containing 39,809 Persian scientific question-answer pairs. It was generated using a two-stage process with
|
83 |
+
|
84 |
+
|
85 |
+
|
86 |
+
gpt-4o-mini on a corpus of scientific abstracts from IranDoc's 'Ganj' repository.
|
87 |
+
|
88 |
+
|
89 |
+
|
90 |
+
|
91 |
+
|
92 |
+
Content: The dataset consists of questions paired with scientific abstracts, primarily from engineering fields.
|
93 |
+
|
94 |
+
|
95 |
+
|
96 |
+
Labels: Each pair has a relevance score from 0 (Not Relevant) to 3 (Highly Relevant), which was normalized to a 0-1 float for training.
|
97 |
+
|
98 |
+
|
99 |
+
Training Procedure
|
100 |
+
The model was trained using the provided train_reranker.py script with the following configuration:
|
101 |
+
|
102 |
+
Epochs: 2
|
103 |
+
|
104 |
+
Batch Size: 16
|
105 |
+
|
106 |
+
Learning Rate: 2e-5
|
107 |
+
|
108 |
+
Loss Function: MSELoss (default for regression in sentence-transformers)
|
109 |
+
|
110 |
+
Evaluator: CECorrelationEvaluator was used to save the best model based on Spearman's rank correlation on the validation set.
|
111 |
+
|
112 |
+
Evaluation
|
113 |
+
The
|
114 |
+
|
115 |
+
PersianSciQA paper reports substantial agreement between the LLM-assigned labels used for training and human expert judgments (Cohen's Kappa of 0.6642). The human validation study confirmed the high quality of the generated questions (88.60% acceptable) and the relevance assessments.
|
116 |
+
|
117 |
+
|
118 |
+
|
119 |
+
Citation
|
120 |
+
If you use this model or the PersianSciQA dataset in your research, please cite the original paper.
|
121 |
+
|
122 |
+
(Note: The provided paper is a pre-print. Please update the citation information once it is officially published.)
|
123 |
+
|
124 |
+
@inproceedings{PersianSciQA2025,
|
125 |
+
title={PersianSciQA: A new Dataset for Bridging the Language Gap in Scientific Question Answering},
|
126 |
+
author={Anonymous},
|
127 |
+
year={2025},
|
128 |
+
booktitle={Proceedings of the International Conference on Recent Advances in Natural Language Processing (RANLP)},
|
129 |
+
note={Confidential review copy. To be updated upon publication.}
|
130 |
+
}
|