ALJIACHI commited on
Commit
ed53e60
·
1 Parent(s): 90e2a32

Initial model upload

Browse files
.gitignore ADDED
@@ -0,0 +1 @@
 
 
1
+ optimizer.pt
README.md ADDED
@@ -0,0 +1,184 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # Mizan-Rerank-v1
2
+
3
+ A revolutionary open-source model for reranking Arabic long texts with exceptional efficiency and accuracy.
4
+
5
+ ![Hugging Face](https://img.shields.io/badge/Hugging%20Face-Mizan--Rerank--v1-blue)
6
+ ![Model Size](https://img.shields.io/badge/Parameters-149M-green)
7
+ ![License](https://img.shields.io/badge/License-Open%20Source-brightgreen)
8
+
9
+ ## Overview
10
+
11
+ Mizan-Rerank-v1 is a leading open-source model based on the modernBERT architecture, specifically designed for reranking search results in Arabic texts. With only 149 million parameters, it offers a perfect balance between performance and efficiency, outperforming larger models while using significantly fewer resources.
12
+
13
+ ## Key Features
14
+
15
+ - **Lightweight & Efficient**: 149M parameters vs competitors with 278-568M parameters
16
+ - **Long Text Processing**: Handles up to 8192 tokens with sliding window technique
17
+ - **High-Speed Inference**: 3x faster than comparable models
18
+ - **Arabic Language Optimization**: Specifically fine-tuned for Arabic language nuances
19
+ - **Resource Efficient**: 75% less memory consumption than competitors
20
+
21
+ ## Performance Benchmarks
22
+
23
+ ### Hardware Performance (RTX 4090 24GB)
24
+
25
+ | Model | RAM Usage | Response Time |
26
+ |-------|-----------|---------------|
27
+ | **Mizan-Rerank-v1** | **1 GB** | **0.1 seconds** |
28
+ | bg-rerank-v2-m3 | 4 GB | 0.3 seconds |
29
+ | jina-reranker-v2-base-multilingual | 2.5 GB | 0.2 seconds |
30
+
31
+ ### MIRACL Dataset Results (ndcg@10)
32
+
33
+ | Model | Score |
34
+ |-------|-------|
35
+ | **Mizan-Rerank-v1** | **0.8865** |
36
+ | bge-reranker-v2-m3 | 0.8863 |
37
+ | jina-reranker-v2-base-multilingual | 0.8481 |
38
+ | Namaa-ARA-Reranker-V1 | 0.7941 |
39
+ | Namaa-Reranker-v1 | 0.7176 |
40
+ | ms-marco-MiniLM-L12-v2 | 0.1750 |
41
+
42
+ ### Reranking and Triplet Datasets (ndcg@10)
43
+
44
+ | Model | Reranking Dataset | Triplet Dataset |
45
+ |-------|-------------------|----------------|
46
+ | **Mizan-Rerank-v1** | **1.0000** | **1.0000** |
47
+ | bge-reranker-v2-m3 | 1.0000 | 0.9998 |
48
+ | jina-reranker-v2-base-multilingual | 1.0000 | 1.0000 |
49
+ | Namaa-ARA-Reranker-V1 | 1.0000 | 0.9989 |
50
+ | Namaa-Reranker-v1 | 1.0000 | 0.9994 |
51
+ | ms-marco-MiniLM-L12-v2 | 0.8906 | 0.9087 |
52
+
53
+ ## Training Methodology
54
+
55
+ Mizan-Rerank-v1 was trained on a diverse corpus of **741,159,981 tokens** from:
56
+
57
+ - Authentic Arabic open-source content
58
+ - Manually processed text collections
59
+ - Purpose-generated synthetic data
60
+
61
+ This comprehensive training approach enables deep understanding of Arabic linguistic contexts.
62
+
63
+ ## How It Works
64
+
65
+ 1. **Query reception**: The model receives a user query and candidate texts
66
+ 2. **Content analysis**: Analyzes semantic relationships between query and each text
67
+ 3. **Relevance scoring**: Assigns a relevance score to each text
68
+ 4. **Reranking**: Sorts results by descending relevance score
69
+
70
+ ## Usage Examples
71
+
72
+ ```python
73
+ from transformers import AutoModelForSequenceClassification, AutoTokenizer
74
+
75
+ # Load model and tokenizer
76
+ model = AutoModelForSequenceClassification.from_pretrained("ALJIACHI/Mizan-Rerank-v1")
77
+ tokenizer = AutoTokenizer.from_pretrained("ALJIACHI/Mizan-Rerank-v1")
78
+
79
+ # Function to calculate relevance score
80
+ def get_relevance_score(query, passage):
81
+ inputs = tokenizer(query, passage, return_tensors="pt", padding=True, truncation=True, max_length=8192)
82
+ outputs = model(**inputs)
83
+ return outputs.logits.item()
84
+
85
+ # Example usage
86
+ query = "ما هو تفسير الآية وجعلنا من الماء كل شيء حي"
87
+ passages = [
88
+ "تعني الآية أن الماء هو عنصر أساسي في حياة جميع الكائنات الحية، وهو ضروري لاستمرار الحياة.",
89
+ "تم اكتشاف كواكب خارج المجموعة الشمسية تحتوي على مياه متجمدة.",
90
+ "تحدث القرآن الكريم عن البرق والرعد في عدة مواضع مختلفة."
91
+ ]
92
+
93
+ # Get scores for each passage
94
+ scores = [(passage, get_relevance_score(query, passage)) for passage in passages]
95
+
96
+ # Rerank passages
97
+ reranked_passages = sorted(scores, key=lambda x: x[1], reverse=True)
98
+
99
+ # Print results
100
+ for passage, score in reranked_passages:
101
+ print(f"Score: {score:.4f} | Passage: {passage}")
102
+ ```
103
+
104
+ ## Practical Examples
105
+
106
+ ### Example 1
107
+
108
+ **السؤال:** كم عدد تحميلات تطبيق حقيبة المؤمن
109
+
110
+ | النص | الدرجة |
111
+ |------|--------|
112
+ | بلغ عدد تحميلات حقيبة المؤمن اكثر من ١٠٠ مليون تحميل | **0.9951** |
113
+ | الاجواء ماطرة جداً في مدينة بغداد يوم الثلاثاء | 0.0031 |
114
+ | اعلنت شركة فيس بوك عن اطلاق تطبيق الانستجرام | 0.0002 |
115
+ | محمد وعلي هما طلاب مجتهدين جداً في دراستهم | 0.0002 |
116
+
117
+ ### Example 2
118
+
119
+ **السؤال:** ما هو القانون الجديد بشأن الضرائب في 2024؟
120
+
121
+ | النص | الدرجة |
122
+ |------|--------|
123
+ | نشرت الجريدة الرسمية قانوناً جديداً في 2024 ينص على زيادة الضرائب على الشركات الكبرى بنسبة 5% | **0.9989** |
124
+ | الضرائب تعد مصدراً مهماً للدخل القومي وتختلف نسبتها من دولة إلى أخرى. | 0.0001 |
125
+ | افتتحت الحكومة مشروعاً جديداً للطاقة المتجددة في 2024. | 0.0001 |
126
+
127
+ ### Example 3
128
+
129
+ **السؤال:** ما هو تفسير الآية وجعلنا من الماء كل شيء حي
130
+
131
+ | النص | الدرجة |
132
+ |------|--------|
133
+ | تعني الآية أن الماء هو عنصر أساسي في حياة جميع الكائنات الحية، وهو ضروري لاستمرار الحياة. | **0.9996** |
134
+ | تم اكتشاف كواكب خارج المجموعة الشمسية تحتوي على مياه متجمدة. | 0.0000 |
135
+ | تحدث القرآن الكريم عن البرق والرعد في عدة مواضع مختلفة. | 0.0000 |
136
+
137
+ ### Example 4
138
+
139
+ **السؤال:** ما هي فوائد فيتامين د؟
140
+
141
+ | النص | الدرجة |
142
+ |------|--------|
143
+ | يساعد فيتامين د في تعزيز صحة العظام وتقوية الجهاز المناعي، كما يلعب دوراً مهماً في امتصاص الكالسيوم. | **0.9991** |
144
+ | يستخدم فيتامين د في بعض الصناعات الغذائية كمادة حافظة. | 0.9941 |
145
+ | يمكن الحصول على فيتامين د من خلال التعرض لأشعة الشمس أو تناول مكملات غذائية. | 0.9938 |
146
+
147
+ ## Applications
148
+
149
+ Mizan-Rerank-v1 opens new horizons for Arabic NLP applications:
150
+
151
+ - Specialized Arabic search engines
152
+ - Archiving systems and digital libraries
153
+ - Conversational AI applications
154
+ - E-learning platforms
155
+ - Information retrieval systems
156
+
157
+ ## Citation
158
+
159
+ If you use Mizan-Rerank-v1 in your research, please cite:
160
+
161
+ ```bibtex
162
+ @software{Mizan_Rerank_v1_2023,
163
+ author = {Ali Aljiachi},
164
+ title = {Mizan-Rerank-v1: A Revolutionary Arabic Text Reranking Model},
165
+ year = {2023},
166
+ publisher = {Hugging Face},
167
+ url = {https://huggingface.co/Mizan/Mizan-Rerank-v1}
168
+ }
169
+ ```
170
+
171
+ @misc{modernbert,
172
+ title={Smarter, Better, Faster, Longer: A Modern Bidirectional Encoder for Fast, Memory Efficient, and Long Context Finetuning and Inference},
173
+ author={Benjamin Warner and Antoine Chaffin and Benjamin Clavié and Orion Weller and Oskar Hallström and Said Taghadouini and Alexis Gallagher and Raja Biswas and Faisal Ladhak and Tom Aarsen and Nathan Cooper and Griffin Adams and Jeremy Howard and Iacopo Poli},
174
+ year={2024},
175
+ eprint={2412.13663},
176
+ archivePrefix={arXiv},
177
+ primaryClass={cs.CL},
178
+ url={https://arxiv.org/abs/2412.13663},
179
+ }
180
+
181
+ ## License
182
+
183
+ We release the Mizan-Rerank model model weights under the Apache 2.0 license.
184
+
config.json ADDED
@@ -0,0 +1,60 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "additional_special_tokens_ids": [],
3
+ "architectures": [
4
+ "ModernBertForSequenceClassification"
5
+ ],
6
+ "attention_bias": false,
7
+ "attention_dropout": 0.0,
8
+ "bos_token_id": null,
9
+ "classifier_activation": "gelu",
10
+ "classifier_bias": false,
11
+ "classifier_dropout": 0.0,
12
+ "classifier_pooling": "mean",
13
+ "cls_token_id": 3,
14
+ "decoder_bias": true,
15
+ "deterministic_flash_attn": false,
16
+ "embedding_dropout": 0.0,
17
+ "eos_token_id": null,
18
+ "global_attn_every_n_layers": 3,
19
+ "global_rope_theta": 160000.0,
20
+ "gradient_checkpointing": false,
21
+ "hidden_activation": "gelu",
22
+ "hidden_size": 768,
23
+ "id2label": {
24
+ "0": "LABEL_0"
25
+ },
26
+ "initializer_cutoff_factor": 2.0,
27
+ "initializer_range": 0.02,
28
+ "intermediate_size": 1152,
29
+ "label2id": {
30
+ "LABEL_0": 0
31
+ },
32
+ "layer_norm_eps": 1e-05,
33
+ "local_attention": 128,
34
+ "local_rope_theta": 10000.0,
35
+ "mask_token_id": 6,
36
+ "max_position_embeddings": 8192,
37
+ "mlp_bias": false,
38
+ "mlp_dropout": 0.0,
39
+ "model_type": "modernbert",
40
+ "norm_bias": false,
41
+ "norm_eps": 1e-05,
42
+ "num_attention_heads": 12,
43
+ "num_hidden_layers": 22,
44
+ "pad_token_id": 5,
45
+ "position_embedding_type": "absolute",
46
+ "reference_compile": false,
47
+ "repad_logits_with_grad": false,
48
+ "sentence_transformers": {
49
+ "activation_fn": "torch.nn.modules.activation.Sigmoid",
50
+ "version": "4.0.1"
51
+ },
52
+ "sep_token_id": 4,
53
+ "sparse_pred_ignore_index": -100,
54
+ "sparse_prediction": false,
55
+ "tokenizer_class": "PreTrainedTokenizerFast",
56
+ "torch_dtype": "float32",
57
+ "transformers_version": "4.50.3",
58
+ "unk_token_id": 2,
59
+ "vocab_size": 50280
60
+ }
model.safetensors ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:88ffdde2887902ea4c18a6fed3c9d608856c32804a662f1e23df2bc8c05db769
3
+ size 598166372
rng_state.pth ADDED
Binary file (14.2 kB). View file
 
scheduler.pt ADDED
Binary file (1.06 kB). View file
 
special_tokens_map.json ADDED
@@ -0,0 +1,37 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "cls_token": {
3
+ "content": "[CLS]",
4
+ "lstrip": false,
5
+ "normalized": false,
6
+ "rstrip": false,
7
+ "single_word": false
8
+ },
9
+ "mask_token": {
10
+ "content": "[MASK]",
11
+ "lstrip": true,
12
+ "normalized": false,
13
+ "rstrip": false,
14
+ "single_word": false
15
+ },
16
+ "pad_token": {
17
+ "content": "[PAD]",
18
+ "lstrip": false,
19
+ "normalized": false,
20
+ "rstrip": false,
21
+ "single_word": false
22
+ },
23
+ "sep_token": {
24
+ "content": "[SEP]",
25
+ "lstrip": false,
26
+ "normalized": false,
27
+ "rstrip": false,
28
+ "single_word": false
29
+ },
30
+ "unk_token": {
31
+ "content": "[UNK]",
32
+ "lstrip": false,
33
+ "normalized": false,
34
+ "rstrip": false,
35
+ "single_word": false
36
+ }
37
+ }
tokenizer.json ADDED
The diff for this file is too large to render. See raw diff
 
tokenizer_config.json ADDED
@@ -0,0 +1,80 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "added_tokens_decoder": {
3
+ "0": {
4
+ "content": "<|padding|>",
5
+ "lstrip": false,
6
+ "normalized": false,
7
+ "rstrip": false,
8
+ "single_word": false,
9
+ "special": true
10
+ },
11
+ "1": {
12
+ "content": "<|endoftext|>",
13
+ "lstrip": false,
14
+ "normalized": false,
15
+ "rstrip": false,
16
+ "single_word": false,
17
+ "special": true
18
+ },
19
+ "2": {
20
+ "content": "[UNK]",
21
+ "lstrip": false,
22
+ "normalized": false,
23
+ "rstrip": false,
24
+ "single_word": false,
25
+ "special": true
26
+ },
27
+ "3": {
28
+ "content": "[CLS]",
29
+ "lstrip": false,
30
+ "normalized": false,
31
+ "rstrip": false,
32
+ "single_word": false,
33
+ "special": true
34
+ },
35
+ "4": {
36
+ "content": "[SEP]",
37
+ "lstrip": false,
38
+ "normalized": false,
39
+ "rstrip": false,
40
+ "single_word": false,
41
+ "special": true
42
+ },
43
+ "5": {
44
+ "content": "[PAD]",
45
+ "lstrip": false,
46
+ "normalized": false,
47
+ "rstrip": false,
48
+ "single_word": false,
49
+ "special": true
50
+ },
51
+ "6": {
52
+ "content": "[MASK]",
53
+ "lstrip": true,
54
+ "normalized": false,
55
+ "rstrip": false,
56
+ "single_word": false,
57
+ "special": true
58
+ }
59
+ },
60
+ "clean_up_tokenization_spaces": true,
61
+ "cls_token": "[CLS]",
62
+ "extra_special_tokens": {},
63
+ "mask_token": "[MASK]",
64
+ "max_length": 512,
65
+ "model_input_names": [
66
+ "input_ids",
67
+ "attention_mask"
68
+ ],
69
+ "model_max_length": 8192,
70
+ "pad_to_multiple_of": null,
71
+ "pad_token": "[PAD]",
72
+ "pad_token_type_id": 0,
73
+ "padding_side": "right",
74
+ "sep_token": "[SEP]",
75
+ "stride": 0,
76
+ "tokenizer_class": "PreTrainedTokenizer",
77
+ "truncation_side": "right",
78
+ "truncation_strategy": "longest_first",
79
+ "unk_token": "[UNK]"
80
+ }