Initial model upload

Browse files

Files changed (9) hide show

.gitignore +1 -0
README.md +184 -0
config.json +60 -0
model.safetensors +3 -0
rng_state.pth +0 -0
scheduler.pt +0 -0
special_tokens_map.json +37 -0
tokenizer.json +0 -0
tokenizer_config.json +80 -0

.gitignore ADDED Viewed

	@@ -0,0 +1 @@


1	+ optimizer.pt

README.md ADDED Viewed

	@@ -0,0 +1,184 @@

+# Mizan-Rerank-v1
+A revolutionary open-source model for reranking Arabic long texts with exceptional efficiency and accuracy.
+![Hugging Face](https://img.shields.io/badge/Hugging%20Face-Mizan--Rerank--v1-blue)
+![Model Size](https://img.shields.io/badge/Parameters-149M-green)
+![License](https://img.shields.io/badge/License-Open%20Source-brightgreen)
+## Overview
+Mizan-Rerank-v1 is a leading open-source model based on the modernBERT architecture, specifically designed for reranking search results in Arabic texts. With only 149 million parameters, it offers a perfect balance between performance and efficiency, outperforming larger models while using significantly fewer resources.
+## Key Features
+- **Lightweight & Efficient**: 149M parameters vs competitors with 278-568M parameters
+- **Long Text Processing**: Handles up to 8192 tokens with sliding window technique
+- **High-Speed Inference**: 3x faster than comparable models
+- **Arabic Language Optimization**: Specifically fine-tuned for Arabic language nuances
+- **Resource Efficient**: 75% less memory consumption than competitors
+## Performance Benchmarks
+### Hardware Performance (RTX 4090 24GB)
+| Model | RAM Usage | Response Time |
+|-------|-----------|---------------|
+| **Mizan-Rerank-v1** | **1 GB** | **0.1 seconds** |
+| bg-rerank-v2-m3 | 4 GB | 0.3 seconds |
+| jina-reranker-v2-base-multilingual | 2.5 GB | 0.2 seconds |
+### MIRACL Dataset Results (ndcg@10)
+| Model | Score |
+|-------|-------|
+| **Mizan-Rerank-v1** | **0.8865** |
+| bge-reranker-v2-m3 | 0.8863 |
+| jina-reranker-v2-base-multilingual | 0.8481 |
+| Namaa-ARA-Reranker-V1 | 0.7941 |
+| Namaa-Reranker-v1 | 0.7176 |
+| ms-marco-MiniLM-L12-v2 | 0.1750 |
+### Reranking and Triplet Datasets (ndcg@10)
+| Model | Reranking Dataset | Triplet Dataset |
+|-------|-------------------|----------------|
+| **Mizan-Rerank-v1** | **1.0000** | **1.0000** |
+| bge-reranker-v2-m3 | 1.0000 | 0.9998 |
+| jina-reranker-v2-base-multilingual | 1.0000 | 1.0000 |
+| Namaa-ARA-Reranker-V1 | 1.0000 | 0.9989 |
+| Namaa-Reranker-v1 | 1.0000 | 0.9994 |
+| ms-marco-MiniLM-L12-v2 | 0.8906 | 0.9087 |
+## Training Methodology
+Mizan-Rerank-v1 was trained on a diverse corpus of **741,159,981 tokens** from:
+- Authentic Arabic open-source content
+- Manually processed text collections
+- Purpose-generated synthetic data
+This comprehensive training approach enables deep understanding of Arabic linguistic contexts.
+## How It Works
+1. **Query reception**: The model receives a user query and candidate texts
+2. **Content analysis**: Analyzes semantic relationships between query and each text
+3. **Relevance scoring**: Assigns a relevance score to each text
+4. **Reranking**: Sorts results by descending relevance score
+## Usage Examples
+```python
+from transformers import AutoModelForSequenceClassification, AutoTokenizer
+# Load model and tokenizer
+model = AutoModelForSequenceClassification.from_pretrained("ALJIACHI/Mizan-Rerank-v1")
+tokenizer = AutoTokenizer.from_pretrained("ALJIACHI/Mizan-Rerank-v1")
+# Function to calculate relevance score
+def get_relevance_score(query, passage):
+    inputs = tokenizer(query, passage, return_tensors="pt", padding=True, truncation=True, max_length=8192)
+    outputs = model(**inputs)
+    return outputs.logits.item()
+# Example usage
+query = "ما هو تفسير الآية وجعلنا من الماء كل شيء حي"
+passages = [
+    "تعني الآية أن الماء هو عنصر أساسي في حياة جميع الكائنات الحية، وهو ضروري لاستمرار الحياة.",
+    "تم اكتشاف كواكب خارج المجموعة الشمسية تحتوي على مياه متجمدة.",
+    "تحدث القرآن الكريم عن البرق والرعد في عدة مواضع مختلفة."
+]
+# Get scores for each passage
+scores = [(passage, get_relevance_score(query, passage)) for passage in passages]
+# Rerank passages
+reranked_passages = sorted(scores, key=lambda x: x[1], reverse=True)
+# Print results
+for passage, score in reranked_passages:
+    print(f"Score: {score:.4f} | Passage: {passage}")
+```
+## Practical Examples
+### Example 1
+**السؤال:** كم عدد تحميلات تطبيق حقيبة المؤمن
+| النص | الدرجة |
+|------|--------|
+| بلغ عدد تحميلات حقيبة المؤمن اكثر من ١٠٠ مليون تحميل | **0.9951** |
+| الاجواء ماطرة جداً في مدينة بغداد يوم الثلاثاء | 0.0031 |
+| اعلنت شركة فيس بوك عن اطلاق تطبيق الانستجرام | 0.0002 |
+| محمد وعلي هما طلاب مجتهدين جداً في دراستهم | 0.0002 |
+### Example 2
+**السؤال:** ما هو القانون الجديد بشأن الضرائب في 2024؟
+| النص | الدرجة |
+|------|--------|
+| نشرت الجريدة الرسمية قانوناً جديداً في 2024 ينص على زيادة الضرائب على الشركات الكبرى بنسبة 5% | **0.9989** |
+| الضرائب تعد مصدراً مهماً للدخل القومي وتختلف نسبتها من دولة إلى أخرى. | 0.0001 |
+| افتتحت الحكومة مشروعاً جديداً للطاقة المتجددة في 2024. | 0.0001 |
+### Example 3
+**السؤال:** ما هو تفسير الآية وجعلنا من الماء كل شيء حي
+| النص | الدرجة |
+|------|--------|
+| تعني الآية أن الماء هو عنصر أساسي في حياة جميع الكائنات الحية، وهو ضروري لاستمرار الحياة. | **0.9996** |
+| تم اكتشاف كواكب خارج المجموعة الشمسية تحتوي على مياه متجمدة. | 0.0000 |
+| تحدث القرآن الكريم عن البرق والرعد في عدة مواضع مختلفة. | 0.0000 |
+### Example 4
+**السؤال:** ما هي فوائد فيتامين د؟
+| النص | الدرجة |
+|------|--------|
+| يساعد فيتامين د في تعزيز صحة العظام وتقوية الجهاز المناعي، كما يلعب دوراً مهماً في امتصاص الكالسيوم. | **0.9991** |
+| يستخدم فيتامين د في بعض الصناعات الغذائية كمادة حافظة. | 0.9941 |
+| يمكن الحصول على فيتامين د من خلال التعرض لأشعة الشمس أو تناول مكملات غذائية. | 0.9938 |
+## Applications
+Mizan-Rerank-v1 opens new horizons for Arabic NLP applications:
+- Specialized Arabic search engines
+- Archiving systems and digital libraries
+- Conversational AI applications
+- E-learning platforms
+- Information retrieval systems
+## Citation
+If you use Mizan-Rerank-v1 in your research, please cite:
+```bibtex
+@software{Mizan_Rerank_v1_2023,
+  author = {Ali Aljiachi},
+  title = {Mizan-Rerank-v1: A Revolutionary Arabic Text Reranking Model},
+  year = {2023},
+  publisher = {Hugging Face},
+  url = {https://huggingface.co/Mizan/Mizan-Rerank-v1}
+}
+```
+@misc{modernbert,
+      title={Smarter, Better, Faster, Longer: A Modern Bidirectional Encoder for Fast, Memory Efficient, and Long Context Finetuning and Inference},
+      author={Benjamin Warner and Antoine Chaffin and Benjamin Clavié and Orion Weller and Oskar Hallström and Said Taghadouini and Alexis Gallagher and Raja Biswas and Faisal Ladhak and Tom Aarsen and Nathan Cooper and Griffin Adams and Jeremy Howard and Iacopo Poli},
+      year={2024},
+      eprint={2412.13663},
+      archivePrefix={arXiv},
+      primaryClass={cs.CL},
+      url={https://arxiv.org/abs/2412.13663},
+}
+## License
+We release the Mizan-Rerank model model weights under the Apache 2.0 license.

config.json ADDED Viewed

	@@ -0,0 +1,60 @@

+{
+  "additional_special_tokens_ids": [],
+  "architectures": [
+    "ModernBertForSequenceClassification"
+  ],
+  "attention_bias": false,
+  "attention_dropout": 0.0,
+  "bos_token_id": null,
+  "classifier_activation": "gelu",
+  "classifier_bias": false,
+  "classifier_dropout": 0.0,
+  "classifier_pooling": "mean",
+  "cls_token_id": 3,
+  "decoder_bias": true,
+  "deterministic_flash_attn": false,
+  "embedding_dropout": 0.0,
+  "eos_token_id": null,
+  "global_attn_every_n_layers": 3,
+  "global_rope_theta": 160000.0,
+  "gradient_checkpointing": false,
+  "hidden_activation": "gelu",
+  "hidden_size": 768,
+  "id2label": {
+    "0": "LABEL_0"
+  },
+  "initializer_cutoff_factor": 2.0,
+  "initializer_range": 0.02,
+  "intermediate_size": 1152,
+  "label2id": {
+    "LABEL_0": 0
+  },
+  "layer_norm_eps": 1e-05,
+  "local_attention": 128,
+  "local_rope_theta": 10000.0,
+  "mask_token_id": 6,
+  "max_position_embeddings": 8192,
+  "mlp_bias": false,
+  "mlp_dropout": 0.0,
+  "model_type": "modernbert",
+  "norm_bias": false,
+  "norm_eps": 1e-05,
+  "num_attention_heads": 12,
+  "num_hidden_layers": 22,
+  "pad_token_id": 5,
+  "position_embedding_type": "absolute",
+  "reference_compile": false,
+  "repad_logits_with_grad": false,
+  "sentence_transformers": {
+    "activation_fn": "torch.nn.modules.activation.Sigmoid",
+    "version": "4.0.1"
+  },
+  "sep_token_id": 4,
+  "sparse_pred_ignore_index": -100,
+  "sparse_prediction": false,
+  "tokenizer_class": "PreTrainedTokenizerFast",
+  "torch_dtype": "float32",
+  "transformers_version": "4.50.3",
+  "unk_token_id": 2,
+  "vocab_size": 50280
+}

model.safetensors ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:88ffdde2887902ea4c18a6fed3c9d608856c32804a662f1e23df2bc8c05db769
+size 598166372

rng_state.pth ADDED Viewed

Binary file (14.2 kB). View file

scheduler.pt ADDED Viewed

Binary file (1.06 kB). View file

special_tokens_map.json ADDED Viewed

	@@ -0,0 +1,37 @@

+{
+  "cls_token": {
+    "content": "[CLS]",
+    "lstrip": false,
+    "normalized": false,
+    "rstrip": false,
+    "single_word": false
+  },
+  "mask_token": {
+    "content": "[MASK]",
+    "lstrip": true,
+    "normalized": false,
+    "rstrip": false,
+    "single_word": false
+  },
+  "pad_token": {
+    "content": "[PAD]",
+    "lstrip": false,
+    "normalized": false,
+    "rstrip": false,
+    "single_word": false
+  },
+  "sep_token": {
+    "content": "[SEP]",
+    "lstrip": false,
+    "normalized": false,
+    "rstrip": false,
+    "single_word": false
+  },
+  "unk_token": {
+    "content": "[UNK]",
+    "lstrip": false,
+    "normalized": false,
+    "rstrip": false,
+    "single_word": false
+  }
+}

tokenizer.json ADDED Viewed

The diff for this file is too large to render. See raw diff

tokenizer_config.json ADDED Viewed

	@@ -0,0 +1,80 @@

+{
+  "added_tokens_decoder": {
+    "0": {
+      "content": "<|padding|>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "1": {
+      "content": "<|endoftext|>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "2": {
+      "content": "[UNK]",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "3": {
+      "content": "[CLS]",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "4": {
+      "content": "[SEP]",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "5": {
+      "content": "[PAD]",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "6": {
+      "content": "[MASK]",
+      "lstrip": true,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    }
+  },
+  "clean_up_tokenization_spaces": true,
+  "cls_token": "[CLS]",
+  "extra_special_tokens": {},
+  "mask_token": "[MASK]",
+  "max_length": 512,
+  "model_input_names": [
+    "input_ids",
+    "attention_mask"
+  ],
+  "model_max_length": 8192,
+  "pad_to_multiple_of": null,
+  "pad_token": "[PAD]",
+  "pad_token_type_id": 0,
+  "padding_side": "right",
+  "sep_token": "[SEP]",
+  "stride": 0,
+  "tokenizer_class": "PreTrainedTokenizer",
+  "truncation_side": "right",
+  "truncation_strategy": "longest_first",
+  "unk_token": "[UNK]"
+}