safora commited on
Commit
98ef47b
·
verified ·
1 Parent(s): c405a14

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +121 -13
README.md CHANGED
@@ -1 +1,121 @@
1
- ---
2
- cross-encoder
3
- reranker
4
- persian
5
- farsi
6
- xlm-roberta
7
- scientific-qa
8
- PersianSciQA
9
- --
10
- **Base Model:** `xlm-roberta-large`
11
- **Task:** Reranking / Sentence Similarity
12
- **Fine-tuning Framework:** `sentence-transformers`
13
- **Language:** Persian (fa)
14
  "بازیابی اطلاعات یک فرآیند پیچیده است که شامل شاخص گذاری و جستجوی اسناد می شود. ارزیابی آن اغلب با معیارهایی مانند دقت و بازیابی انجام می شود.", # "Information retrieval is a complex process involving indexing and searching documents. Its evaluation is often done with metrics like precision and recall."
15
  "یادگیری عمیق در سال های اخیر پیشرفت های چشمگیری در پردازش زبان طبیعی داشته است.", # "Deep learning has made significant progress in natural language processing in recent years."
16
  "این مقاله به بررسی روش های جدید برای ارزیابی سیستم های بازیابی اطلاعات معنایی می پردازد و معیارهای نوینی را معرفی می کند." # "This paper examines new methods for evaluating semantic information retrieval systems and introduces novel metrics."
17
  print(f"Score: {scores[i]:.4f}\t Document: {documents[i]}")
18
  title={PersianSciQA: A new Dataset for Bridging the Language Gap in Scientific Question Answering},
19
  author={Anonymous},
20
  year={2025},
21
  booktitle={Proceedings of the International Conference on Recent Advances in Natural Language Processing (RANLP)},
22
  note={Confidential review copy. To be updated upon publication.}
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
  "بازیابی اطلاعات یک فرآیند پیچیده است که شامل شاخص گذاری و جستجوی اسناد می شود. ارزیابی آن اغلب با معیارهایی مانند دقت و بازیابی انجام می شود.", # "Information retrieval is a complex process involving indexing and searching documents. Its evaluation is often done with metrics like precision and recall."
2
  "یادگیری عمیق در سال های اخیر پیشرفت های چشمگیری در پردازش زبان طبیعی داشته است.", # "Deep learning has made significant progress in natural language processing in recent years."
3
  "این مقاله به بررسی روش های جدید برای ارزیابی سیستم های بازیابی اطلاعات معنایی می پردازد و معیارهای نوینی را معرفی می کند." # "This paper examines new methods for evaluating semantic information retrieval systems and introduces novel metrics."
4
  print(f"Score: {scores[i]:.4f}\t Document: {documents[i]}")
5
  title={PersianSciQA: A new Dataset for Bridging the Language Gap in Scientific Question Answering},
6
  author={Anonymous},
7
  year={2025},
8
  booktitle={Proceedings of the International Conference on Recent Advances in Natural Language Processing (RANLP)},
9
  note={Confidential review copy. To be updated upon publication.}
10
+ ---
11
+ language: fa
12
+ library_name: sentence-transformers
13
+ pipeline_tag: sentence-similarity
14
+ tags:
15
+ - cross-encoder
16
+ - reranker
17
+ - persian
18
+ - farsi
19
+ - xlm-roberta
20
+ - scientific-qa
21
+ dataset:
22
+ - PersianSciQA
23
+ ---
24
+
25
+ # Cross-Encoder for Persian Scientific Relevance Ranking
26
+
27
+ This is a cross-encoder model based on `xlm-roberta-large` that has been fine-tuned for relevance ranking of Persian scientific texts. It takes a question and a document (an abstract) as input and outputs a score from 0 to 1 indicating their relevance.
28
+
29
+ This model was trained as a reranker for a Persian scientific Question Answering system.
30
+
31
+ ## Model Details
32
+
33
+ - **Base Model:** `xlm-roberta-large`
34
+ - **Task:** Reranking / Sentence Similarity
35
+ - **Fine-tuning Framework:** `sentence-transformers`
36
+ - **Language:** Persian (fa)
37
+
38
+ ## Intended Use
39
+
40
+ The primary use of this model is to act as a **reranker** in a search or question-answering pipeline. Given a user's query and a list of candidate documents retrieved by a faster first-stage model (like BM25 or a bi-encoder), this cross-encoder can re-score the top candidates to provide a more accurate final ranking.
41
+
42
+ ### How to Use
43
+
44
+ To use the model, first install the `sentence-transformers` library:
45
+ ```bash
46
+ pip install -U sentence-transformers
47
+ from sentence_transformers import CrossEncoder
48
+
49
+ # Load the model from the Hugging Face Hub
50
+ model_name = 'YOUR_HF_USERNAME/reranker-xlm-roberta-large' #<-- IMPORTANT: Replace with your model name!
51
+ model = CrossEncoder(model_name)
52
+
53
+ # Prepare your query and document pairs
54
+ query = "روش های ارزیابی در بازیابی اطلاعات چیست؟" # "What are the evaluation methods in information retrieval?"
55
+ documents = [
56
+ "بازیابی اطلاعات یک فرآیند پیچیده است که شامل شاخص گذاری و جستجوی اسناد می شود. ارزیابی آن اغلب با معیارهایی مانند دقت و بازیابی انجام می شود.", # "Information retrieval is a complex process involving indexing and searching documents. Its evaluation is often done with metrics like precision and recall."
57
+ "یادگیری عمیق در سال های اخیر پیشرفت های چشمگیری در پردازش زبان طبیعی داشته است.", # "Deep learning has made significant progress in natural language processing in recent years."
58
+ "این مقاله به بررسی روش های جدید برای ارزیابی سیستم های بازیابی اطلاعات معنایی می پردازد و معیارهای نوینی را معرفی می کند." # "This paper examines new methods for evaluating semantic information retrieval systems and introduces novel metrics."
59
+ ]
60
+
61
+ # Create pairs for scoring
62
+ sentence_pairs = [[query, doc] for doc in documents]
63
+
64
+ # Predict the scores
65
+ scores = model.predict(sentence_pairs, convert_to_numpy=True)
66
+
67
+ # Print results
68
+ for i in range(len(scores)):
69
+ print(f"Score: {scores[i]:.4f}\t Document: {documents[i]}")
70
+
71
+ # Expected output (scores will vary but should follow this trend):
72
+ # Score: 0.9123 Document: This paper examines new methods for evaluating semantic information retrieval systems and introduces novel metrics.
73
+ # Score: 0.7543 Document: Information retrieval is a complex process involving indexing and searching documents. Its evaluation is often done with metrics like precision and recall.
74
+ # Score: 0.0123 Document: Deep learning has made significant progress in natural language processing in recent years.
75
+ This model was fine-tuned on the
76
+
77
+ PersianSciQA dataset.
78
+
79
+
80
+
81
+
82
+ Description: PersianSciQA is a large-scale dataset containing 39,809 Persian scientific question-answer pairs. It was generated using a two-stage process with
83
+
84
+
85
+
86
+ gpt-4o-mini on a corpus of scientific abstracts from IranDoc's 'Ganj' repository.
87
+
88
+
89
+
90
+
91
+
92
+ Content: The dataset consists of questions paired with scientific abstracts, primarily from engineering fields.
93
+
94
+
95
+
96
+ Labels: Each pair has a relevance score from 0 (Not Relevant) to 3 (Highly Relevant), which was normalized to a 0-1 float for training.
97
+
98
+
99
+ Training Procedure
100
+ The model was trained using the provided train_reranker.py script with the following configuration:
101
+
102
+ Epochs: 2
103
+
104
+ Batch Size: 16
105
+
106
+ Learning Rate: 2e-5
107
+
108
+ Loss Function: MSELoss (default for regression in sentence-transformers)
109
+
110
+ Evaluator: CECorrelationEvaluator was used to save the best model based on Spearman's rank correlation on the validation set.
111
+
112
+ Evaluation
113
+ The
114
+
115
+ PersianSciQA paper reports substantial agreement between the LLM-assigned labels used for training and human expert judgments (Cohen's Kappa of 0.6642). The human validation study confirmed the high quality of the generated questions (88.60% acceptable) and the relevance assessments.
116
+
117
+
118
+
119
+ Citation
120
+ If you use this model or the PersianSciQA dataset in your research, please cite the original paper.
121
+
122
+ (Note: The provided paper is a pre-print. Please update the citation information once it is officially published.)
123
+
124
+ @inproceedings{PersianSciQA2025,
125
+ title={PersianSciQA: A new Dataset for Bridging the Language Gap in Scientific Question Answering},
126
+ author={Anonymous},
127
+ year={2025},
128
+ booktitle={Proceedings of the International Conference on Recent Advances in Natural Language Processing (RANLP)},
129
+ note={Confidential review copy. To be updated upon publication.}
130
+ }