ALJIACHI commited on
Commit
27c47ce
·
2 Parent(s): 480408b 8b0220c

Initial model upload

Browse files
.gitattributes CHANGED
@@ -1,3 +1,4 @@
 
1
  *.7z filter=lfs diff=lfs merge=lfs -text
2
  *.arrow filter=lfs diff=lfs merge=lfs -text
3
  *.bin filter=lfs diff=lfs merge=lfs -text
@@ -33,3 +34,7 @@ saved_model/**/* filter=lfs diff=lfs merge=lfs -text
33
  *.zip filter=lfs diff=lfs merge=lfs -text
34
  *.zst filter=lfs diff=lfs merge=lfs -text
35
  *tfevents* filter=lfs diff=lfs merge=lfs -text
 
 
 
 
 
1
+ <<<<<<< HEAD
2
  *.7z filter=lfs diff=lfs merge=lfs -text
3
  *.arrow filter=lfs diff=lfs merge=lfs -text
4
  *.bin filter=lfs diff=lfs merge=lfs -text
 
34
  *.zip filter=lfs diff=lfs merge=lfs -text
35
  *.zst filter=lfs diff=lfs merge=lfs -text
36
  *tfevents* filter=lfs diff=lfs merge=lfs -text
37
+ =======
38
+ *.bin filter=lfs diff=lfs merge=lfs -text
39
+ *.safetensors filter=lfs diff=lfs merge=lfs -text
40
+ >>>>>>> master
.gitignore ADDED
@@ -0,0 +1,4 @@
 
 
 
 
 
1
+ optimizer.pt
2
+ .gitignore
3
+ .gitattributes
4
+ scheduler.pt
README.md CHANGED
@@ -1,3 +1,190 @@
 
1
  ---
2
  license: apache-2.0
3
  ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ <<<<<<< HEAD
2
  ---
3
  license: apache-2.0
4
  ---
5
+ =======
6
+ # Mizan-Rerank-v1
7
+
8
+ A revolutionary open-source model for reranking Arabic long texts with exceptional efficiency and accuracy.
9
+
10
+ ![Hugging Face](https://img.shields.io/badge/Hugging%20Face-Mizan--Rerank--v1-blue)
11
+ ![Model Size](https://img.shields.io/badge/Parameters-149M-green)
12
+ ![License](https://img.shields.io/badge/License-Open%20Source-brightgreen)
13
+
14
+ ## Overview
15
+
16
+ Mizan-Rerank-v1 is a leading open-source model based on the modernBERT architecture, specifically designed for reranking search results in Arabic texts. With only 149 million parameters, it offers a perfect balance between performance and efficiency, outperforming larger models while using significantly fewer resources.
17
+
18
+ ## Key Features
19
+
20
+ - **Lightweight & Efficient**: 149M parameters vs competitors with 278-568M parameters
21
+ - **Long Text Processing**: Handles up to 8192 tokens with sliding window technique
22
+ - **High-Speed Inference**: 3x faster than comparable models
23
+ - **Arabic Language Optimization**: Specifically fine-tuned for Arabic language nuances
24
+ - **Resource Efficient**: 75% less memory consumption than competitors
25
+
26
+ ## Performance Benchmarks
27
+
28
+ ### Hardware Performance (RTX 4090 24GB)
29
+
30
+ | Model | RAM Usage | Response Time |
31
+ |-------|-----------|---------------|
32
+ | **Mizan-Rerank-v1** | **1 GB** | **0.1 seconds** |
33
+ | bg-rerank-v2-m3 | 4 GB | 0.3 seconds |
34
+ | jina-reranker-v2-base-multilingual | 2.5 GB | 0.2 seconds |
35
+
36
+ ### MIRACL Dataset Results (ndcg@10)
37
+
38
+ | Model | Score |
39
+ |-------|-------|
40
+ | **Mizan-Rerank-v1** | **0.8865** |
41
+ | bge-reranker-v2-m3 | 0.8863 |
42
+ | jina-reranker-v2-base-multilingual | 0.8481 |
43
+ | Namaa-ARA-Reranker-V1 | 0.7941 |
44
+ | Namaa-Reranker-v1 | 0.7176 |
45
+ | ms-marco-MiniLM-L12-v2 | 0.1750 |
46
+
47
+ ### Reranking and Triplet Datasets (ndcg@10)
48
+
49
+ | Model | Reranking Dataset | Triplet Dataset |
50
+ |-------|-------------------|----------------|
51
+ | **Mizan-Rerank-v1** | **1.0000** | **1.0000** |
52
+ | bge-reranker-v2-m3 | 1.0000 | 0.9998 |
53
+ | jina-reranker-v2-base-multilingual | 1.0000 | 1.0000 |
54
+ | Namaa-ARA-Reranker-V1 | 1.0000 | 0.9989 |
55
+ | Namaa-Reranker-v1 | 1.0000 | 0.9994 |
56
+ | ms-marco-MiniLM-L12-v2 | 0.8906 | 0.9087 |
57
+
58
+ ## Training Methodology
59
+
60
+ Mizan-Rerank-v1 was trained on a diverse corpus of **741,159,981 tokens** from:
61
+
62
+ - Authentic Arabic open-source content
63
+ - Manually processed text collections
64
+ - Purpose-generated synthetic data
65
+
66
+ This comprehensive training approach enables deep understanding of Arabic linguistic contexts.
67
+
68
+ ## How It Works
69
+
70
+ 1. **Query reception**: The model receives a user query and candidate texts
71
+ 2. **Content analysis**: Analyzes semantic relationships between query and each text
72
+ 3. **Relevance scoring**: Assigns a relevance score to each text
73
+ 4. **Reranking**: Sorts results by descending relevance score
74
+
75
+ ## Usage Examples
76
+
77
+ ```python
78
+ from transformers import AutoModelForSequenceClassification, AutoTokenizer
79
+
80
+ # Load model and tokenizer
81
+ model = AutoModelForSequenceClassification.from_pretrained("ALJIACHI/Mizan-Rerank-v1")
82
+ tokenizer = AutoTokenizer.from_pretrained("ALJIACHI/Mizan-Rerank-v1")
83
+
84
+ # Function to calculate relevance score
85
+ def get_relevance_score(query, passage):
86
+ inputs = tokenizer(query, passage, return_tensors="pt", padding=True, truncation=True, max_length=8192)
87
+ outputs = model(**inputs)
88
+ return outputs.logits.item()
89
+
90
+ # Example usage
91
+ query = "ما هو تفسير الآية وجعلنا من الماء كل شيء حي"
92
+ passages = [
93
+ "تعني الآية أن الماء هو عنصر أساسي في حياة جميع الكائنات الحية، وهو ضروري لاستمرار الحياة.",
94
+ "تم اكتشاف كواكب خارج المجموعة الشمسية تحتوي على مياه متجمدة.",
95
+ "تحدث القرآن الكريم عن البرق والرعد في عدة مواضع مختلفة."
96
+ ]
97
+
98
+ # Get scores for each passage
99
+ scores = [(passage, get_relevance_score(query, passage)) for passage in passages]
100
+
101
+ # Rerank passages
102
+ reranked_passages = sorted(scores, key=lambda x: x[1], reverse=True)
103
+
104
+ # Print results
105
+ for passage, score in reranked_passages:
106
+ print(f"Score: {score:.4f} | Passage: {passage}")
107
+ ```
108
+
109
+ ## Practical Examples
110
+
111
+ ### Example 1
112
+
113
+ **السؤال:** كم عدد تحميلات تطبيق حقيبة المؤمن
114
+
115
+ | النص | الدرجة |
116
+ |------|--------|
117
+ | بلغ عدد تحميلات حقيبة المؤمن اكثر من ١٠٠ مليون تحميل | **0.9951** |
118
+ | الاجواء ماطرة جداً في مدينة بغداد يوم الثلاثاء | 0.0031 |
119
+ | اعلنت شركة فيس بوك عن اطلاق تطبيق الانستجرام | 0.0002 |
120
+ | محمد وعلي هما طلاب مجتهدين جداً في دراستهم | 0.0002 |
121
+
122
+ ### Example 2
123
+
124
+ **السؤال:** ما هو القانون الجديد بشأن الضرائب في 2024؟
125
+
126
+ | النص | الدرجة |
127
+ |------|--------|
128
+ | نشرت الجريدة الرسمية قانوناً جديداً في 2024 ينص على ��يادة الضرائب على الشركات الكبرى بنسبة 5% | **0.9989** |
129
+ | الضرائب تعد مصدراً مهماً للدخل القومي وتختلف نسبتها من دولة إلى أخرى. | 0.0001 |
130
+ | افتتحت الحكومة مشروعاً جديداً للطاقة المتجددة في 2024. | 0.0001 |
131
+
132
+ ### Example 3
133
+
134
+ **السؤال:** ما هو تفسير الآية وجعلنا من الماء كل شيء حي
135
+
136
+ | النص | الدرجة |
137
+ |------|--------|
138
+ | تعني الآية أن الماء هو عنصر أساسي في حياة جميع الكائنات الحية، وهو ضروري لاستمرار الحياة. | **0.9996** |
139
+ | تم اكتشاف كواكب خارج المجموعة الشمسية تحتوي على مياه متجمدة. | 0.0000 |
140
+ | تحدث القرآن الكريم عن البرق والرعد في عدة مواضع مختلفة. | 0.0000 |
141
+
142
+ ### Example 4
143
+
144
+ **السؤال:** ما هي فوائد فيتامين د؟
145
+
146
+ | النص | الدرجة |
147
+ |------|--------|
148
+ | يساعد فيتامين د في تعزيز صحة العظام وتقوية الجهاز المناعي، كما يلعب دوراً مهماً في امتصاص الكالسيوم. | **0.9991** |
149
+ | يستخدم فيتامين د في بعض الصناعات الغذائية كمادة حافظة. | 0.9941 |
150
+ | يمكن الحصول على فيتامين د من خلال التعرض لأشعة الشمس أو تناول مكملات غذائية. | 0.9938 |
151
+
152
+ ## Applications
153
+
154
+ Mizan-Rerank-v1 opens new horizons for Arabic NLP applications:
155
+
156
+ - Specialized Arabic search engines
157
+ - Archiving systems and digital libraries
158
+ - Conversational AI applications
159
+ - E-learning platforms
160
+ - Information retrieval systems
161
+
162
+ ## Citation
163
+
164
+ If you use Mizan-Rerank-v1 in your research, please cite:
165
+
166
+ ```bibtex
167
+ @software{Mizan_Rerank_v1_2023,
168
+ author = {Ali Aljiachi},
169
+ title = {Mizan-Rerank-v1: A Revolutionary Arabic Text Reranking Model},
170
+ year = {2023},
171
+ publisher = {Hugging Face},
172
+ url = {https://huggingface.co/Mizan/Mizan-Rerank-v1}
173
+ }
174
+ ```
175
+
176
+ @misc{modernbert,
177
+ title={Smarter, Better, Faster, Longer: A Modern Bidirectional Encoder for Fast, Memory Efficient, and Long Context Finetuning and Inference},
178
+ author={Benjamin Warner and Antoine Chaffin and Benjamin Clavié and Orion Weller and Oskar Hallström and Said Taghadouini and Alexis Gallagher and Raja Biswas and Faisal Ladhak and Tom Aarsen and Nathan Cooper and Griffin Adams and Jeremy Howard and Iacopo Poli},
179
+ year={2024},
180
+ eprint={2412.13663},
181
+ archivePrefix={arXiv},
182
+ primaryClass={cs.CL},
183
+ url={https://arxiv.org/abs/2412.13663},
184
+ }
185
+
186
+ ## License
187
+
188
+ We release the Mizan-Rerank model model weights under the Apache 2.0 license.
189
+
190
+ >>>>>>> master
config.json ADDED
@@ -0,0 +1,60 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "additional_special_tokens_ids": [],
3
+ "architectures": [
4
+ "ModernBertForSequenceClassification"
5
+ ],
6
+ "attention_bias": false,
7
+ "attention_dropout": 0.0,
8
+ "bos_token_id": null,
9
+ "classifier_activation": "gelu",
10
+ "classifier_bias": false,
11
+ "classifier_dropout": 0.0,
12
+ "classifier_pooling": "mean",
13
+ "cls_token_id": 3,
14
+ "decoder_bias": true,
15
+ "deterministic_flash_attn": false,
16
+ "embedding_dropout": 0.0,
17
+ "eos_token_id": null,
18
+ "global_attn_every_n_layers": 3,
19
+ "global_rope_theta": 160000.0,
20
+ "gradient_checkpointing": false,
21
+ "hidden_activation": "gelu",
22
+ "hidden_size": 768,
23
+ "id2label": {
24
+ "0": "LABEL_0"
25
+ },
26
+ "initializer_cutoff_factor": 2.0,
27
+ "initializer_range": 0.02,
28
+ "intermediate_size": 1152,
29
+ "label2id": {
30
+ "LABEL_0": 0
31
+ },
32
+ "layer_norm_eps": 1e-05,
33
+ "local_attention": 128,
34
+ "local_rope_theta": 10000.0,
35
+ "mask_token_id": 6,
36
+ "max_position_embeddings": 8192,
37
+ "mlp_bias": false,
38
+ "mlp_dropout": 0.0,
39
+ "model_type": "modernbert",
40
+ "norm_bias": false,
41
+ "norm_eps": 1e-05,
42
+ "num_attention_heads": 12,
43
+ "num_hidden_layers": 22,
44
+ "pad_token_id": 5,
45
+ "position_embedding_type": "absolute",
46
+ "reference_compile": false,
47
+ "repad_logits_with_grad": false,
48
+ "sentence_transformers": {
49
+ "activation_fn": "torch.nn.modules.activation.Sigmoid",
50
+ "version": "4.0.1"
51
+ },
52
+ "sep_token_id": 4,
53
+ "sparse_pred_ignore_index": -100,
54
+ "sparse_prediction": false,
55
+ "tokenizer_class": "PreTrainedTokenizerFast",
56
+ "torch_dtype": "float32",
57
+ "transformers_version": "4.50.3",
58
+ "unk_token_id": 2,
59
+ "vocab_size": 50280
60
+ }
model.safetensors ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:88ffdde2887902ea4c18a6fed3c9d608856c32804a662f1e23df2bc8c05db769
3
+ size 598166372
rng_state.pth ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:93e33002876a7677abe5796d70d473302d9bde216013e7f87665b96a2fbad655
3
+ size 14244
scheduler.pt ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:a68b72e2d29aba381128697445fce4b3338c1387fabe319bf86a5d67fb8671af
3
+ size 1064
special_tokens_map.json ADDED
@@ -0,0 +1,37 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "cls_token": {
3
+ "content": "[CLS]",
4
+ "lstrip": false,
5
+ "normalized": false,
6
+ "rstrip": false,
7
+ "single_word": false
8
+ },
9
+ "mask_token": {
10
+ "content": "[MASK]",
11
+ "lstrip": true,
12
+ "normalized": false,
13
+ "rstrip": false,
14
+ "single_word": false
15
+ },
16
+ "pad_token": {
17
+ "content": "[PAD]",
18
+ "lstrip": false,
19
+ "normalized": false,
20
+ "rstrip": false,
21
+ "single_word": false
22
+ },
23
+ "sep_token": {
24
+ "content": "[SEP]",
25
+ "lstrip": false,
26
+ "normalized": false,
27
+ "rstrip": false,
28
+ "single_word": false
29
+ },
30
+ "unk_token": {
31
+ "content": "[UNK]",
32
+ "lstrip": false,
33
+ "normalized": false,
34
+ "rstrip": false,
35
+ "single_word": false
36
+ }
37
+ }
tokenizer.json ADDED
The diff for this file is too large to render. See raw diff
 
tokenizer_config.json ADDED
@@ -0,0 +1,80 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "added_tokens_decoder": {
3
+ "0": {
4
+ "content": "<|padding|>",
5
+ "lstrip": false,
6
+ "normalized": false,
7
+ "rstrip": false,
8
+ "single_word": false,
9
+ "special": true
10
+ },
11
+ "1": {
12
+ "content": "<|endoftext|>",
13
+ "lstrip": false,
14
+ "normalized": false,
15
+ "rstrip": false,
16
+ "single_word": false,
17
+ "special": true
18
+ },
19
+ "2": {
20
+ "content": "[UNK]",
21
+ "lstrip": false,
22
+ "normalized": false,
23
+ "rstrip": false,
24
+ "single_word": false,
25
+ "special": true
26
+ },
27
+ "3": {
28
+ "content": "[CLS]",
29
+ "lstrip": false,
30
+ "normalized": false,
31
+ "rstrip": false,
32
+ "single_word": false,
33
+ "special": true
34
+ },
35
+ "4": {
36
+ "content": "[SEP]",
37
+ "lstrip": false,
38
+ "normalized": false,
39
+ "rstrip": false,
40
+ "single_word": false,
41
+ "special": true
42
+ },
43
+ "5": {
44
+ "content": "[PAD]",
45
+ "lstrip": false,
46
+ "normalized": false,
47
+ "rstrip": false,
48
+ "single_word": false,
49
+ "special": true
50
+ },
51
+ "6": {
52
+ "content": "[MASK]",
53
+ "lstrip": true,
54
+ "normalized": false,
55
+ "rstrip": false,
56
+ "single_word": false,
57
+ "special": true
58
+ }
59
+ },
60
+ "clean_up_tokenization_spaces": true,
61
+ "cls_token": "[CLS]",
62
+ "extra_special_tokens": {},
63
+ "mask_token": "[MASK]",
64
+ "max_length": 512,
65
+ "model_input_names": [
66
+ "input_ids",
67
+ "attention_mask"
68
+ ],
69
+ "model_max_length": 8192,
70
+ "pad_to_multiple_of": null,
71
+ "pad_token": "[PAD]",
72
+ "pad_token_type_id": 0,
73
+ "padding_side": "right",
74
+ "sep_token": "[SEP]",
75
+ "stride": 0,
76
+ "tokenizer_class": "PreTrainedTokenizer",
77
+ "truncation_side": "right",
78
+ "truncation_strategy": "longest_first",
79
+ "unk_token": "[UNK]"
80
+ }