---
language:
- en
tags:
- ColBERT
- PyLate
- sentence-transformers
- sentence-similarity
- feature-extraction
- generated_from_trainer
- dataset_size:443147
- loss:Distillation
base_model: artiwise-ai/modernbert-base-tr-uncased
datasets:
- Speedsy/msmarco-cleaned-gemini-bge-tr-uncased
pipeline_tag: sentence-similarity
library_name: PyLate
metrics:
- MaxSim_accuracy@1
- MaxSim_accuracy@3
- MaxSim_accuracy@5
- MaxSim_accuracy@10
- MaxSim_precision@1
- MaxSim_precision@3
- MaxSim_precision@5
- MaxSim_precision@10
- MaxSim_recall@1
- MaxSim_recall@3
- MaxSim_recall@5
- MaxSim_recall@10
- MaxSim_ndcg@10
- MaxSim_mrr@10
- MaxSim_map@100
model-index:
- name: PyLate model based on artiwise-ai/modernbert-base-tr-uncased
results:
- task:
type: py-late-information-retrieval
name: Py Late Information Retrieval
dataset:
name: NanoDBPedia
type: NanoDBPedia
metrics:
- type: MaxSim_accuracy@1
value: 0.78
name: Maxsim Accuracy@1
- type: MaxSim_accuracy@3
value: 0.92
name: Maxsim Accuracy@3
- type: MaxSim_accuracy@5
value: 0.96
name: Maxsim Accuracy@5
- type: MaxSim_accuracy@10
value: 1.0
name: Maxsim Accuracy@10
- type: MaxSim_precision@1
value: 0.78
name: Maxsim Precision@1
- type: MaxSim_precision@3
value: 0.68
name: Maxsim Precision@3
- type: MaxSim_precision@5
value: 0.596
name: Maxsim Precision@5
- type: MaxSim_precision@10
value: 0.5459999999999999
name: Maxsim Precision@10
- type: MaxSim_recall@1
value: 0.08078717061354299
name: Maxsim Recall@1
- type: MaxSim_recall@3
value: 0.1904489241619047
name: Maxsim Recall@3
- type: MaxSim_recall@5
value: 0.26256917349788084
name: Maxsim Recall@5
- type: MaxSim_recall@10
value: 0.39256681253841286
name: Maxsim Recall@10
- type: MaxSim_ndcg@10
value: 0.6694382434315426
name: Maxsim Ndcg@10
- type: MaxSim_mrr@10
value: 0.8612222222222222
name: Maxsim Mrr@10
- type: MaxSim_map@100
value: 0.5270972799616637
name: Maxsim Map@100
- task:
type: py-late-information-retrieval
name: Py Late Information Retrieval
dataset:
name: NanoFiQA2018
type: NanoFiQA2018
metrics:
- type: MaxSim_accuracy@1
value: 0.48
name: Maxsim Accuracy@1
- type: MaxSim_accuracy@3
value: 0.62
name: Maxsim Accuracy@3
- type: MaxSim_accuracy@5
value: 0.72
name: Maxsim Accuracy@5
- type: MaxSim_accuracy@10
value: 0.72
name: Maxsim Accuracy@10
- type: MaxSim_precision@1
value: 0.48
name: Maxsim Precision@1
- type: MaxSim_precision@3
value: 0.27999999999999997
name: Maxsim Precision@3
- type: MaxSim_precision@5
value: 0.21999999999999997
name: Maxsim Precision@5
- type: MaxSim_precision@10
value: 0.12999999999999998
name: Maxsim Precision@10
- type: MaxSim_recall@1
value: 0.25257936507936507
name: Maxsim Recall@1
- type: MaxSim_recall@3
value: 0.3990714285714285
name: Maxsim Recall@3
- type: MaxSim_recall@5
value: 0.510595238095238
name: Maxsim Recall@5
- type: MaxSim_recall@10
value: 0.5472063492063493
name: Maxsim Recall@10
- type: MaxSim_ndcg@10
value: 0.47985220902930087
name: Maxsim Ndcg@10
- type: MaxSim_mrr@10
value: 0.5619999999999999
name: Maxsim Mrr@10
- type: MaxSim_map@100
value: 0.41362574871825997
name: Maxsim Map@100
- task:
type: py-late-information-retrieval
name: Py Late Information Retrieval
dataset:
name: NanoHotpotQA
type: NanoHotpotQA
metrics:
- type: MaxSim_accuracy@1
value: 0.92
name: Maxsim Accuracy@1
- type: MaxSim_accuracy@3
value: 0.98
name: Maxsim Accuracy@3
- type: MaxSim_accuracy@5
value: 1.0
name: Maxsim Accuracy@5
- type: MaxSim_accuracy@10
value: 1.0
name: Maxsim Accuracy@10
- type: MaxSim_precision@1
value: 0.92
name: Maxsim Precision@1
- type: MaxSim_precision@3
value: 0.5133333333333333
name: Maxsim Precision@3
- type: MaxSim_precision@5
value: 0.33599999999999997
name: Maxsim Precision@5
- type: MaxSim_precision@10
value: 0.17
name: Maxsim Precision@10
- type: MaxSim_recall@1
value: 0.46
name: Maxsim Recall@1
- type: MaxSim_recall@3
value: 0.77
name: Maxsim Recall@3
- type: MaxSim_recall@5
value: 0.84
name: Maxsim Recall@5
- type: MaxSim_recall@10
value: 0.85
name: Maxsim Recall@10
- type: MaxSim_ndcg@10
value: 0.8340361138357484
name: Maxsim Ndcg@10
- type: MaxSim_mrr@10
value: 0.9516666666666667
name: Maxsim Mrr@10
- type: MaxSim_map@100
value: 0.7774992099056552
name: Maxsim Map@100
- task:
type: py-late-information-retrieval
name: Py Late Information Retrieval
dataset:
name: NanoMSMARCO
type: NanoMSMARCO
metrics:
- type: MaxSim_accuracy@1
value: 0.42
name: Maxsim Accuracy@1
- type: MaxSim_accuracy@3
value: 0.6
name: Maxsim Accuracy@3
- type: MaxSim_accuracy@5
value: 0.7
name: Maxsim Accuracy@5
- type: MaxSim_accuracy@10
value: 0.8
name: Maxsim Accuracy@10
- type: MaxSim_precision@1
value: 0.42
name: Maxsim Precision@1
- type: MaxSim_precision@3
value: 0.2
name: Maxsim Precision@3
- type: MaxSim_precision@5
value: 0.14
name: Maxsim Precision@5
- type: MaxSim_precision@10
value: 0.08
name: Maxsim Precision@10
- type: MaxSim_recall@1
value: 0.42
name: Maxsim Recall@1
- type: MaxSim_recall@3
value: 0.6
name: Maxsim Recall@3
- type: MaxSim_recall@5
value: 0.7
name: Maxsim Recall@5
- type: MaxSim_recall@10
value: 0.8
name: Maxsim Recall@10
- type: MaxSim_ndcg@10
value: 0.6031078965623429
name: Maxsim Ndcg@10
- type: MaxSim_mrr@10
value: 0.5408333333333333
name: Maxsim Mrr@10
- type: MaxSim_map@100
value: 0.5486820427095569
name: Maxsim Map@100
- task:
type: py-late-information-retrieval
name: Py Late Information Retrieval
dataset:
name: NanoNQ
type: NanoNQ
metrics:
- type: MaxSim_accuracy@1
value: 0.58
name: Maxsim Accuracy@1
- type: MaxSim_accuracy@3
value: 0.7
name: Maxsim Accuracy@3
- type: MaxSim_accuracy@5
value: 0.76
name: Maxsim Accuracy@5
- type: MaxSim_accuracy@10
value: 0.84
name: Maxsim Accuracy@10
- type: MaxSim_precision@1
value: 0.58
name: Maxsim Precision@1
- type: MaxSim_precision@3
value: 0.24
name: Maxsim Precision@3
- type: MaxSim_precision@5
value: 0.15600000000000003
name: Maxsim Precision@5
- type: MaxSim_precision@10
value: 0.09
name: Maxsim Precision@10
- type: MaxSim_recall@1
value: 0.57
name: Maxsim Recall@1
- type: MaxSim_recall@3
value: 0.69
name: Maxsim Recall@3
- type: MaxSim_recall@5
value: 0.73
name: Maxsim Recall@5
- type: MaxSim_recall@10
value: 0.81
name: Maxsim Recall@10
- type: MaxSim_ndcg@10
value: 0.6918755447681874
name: Maxsim Ndcg@10
- type: MaxSim_mrr@10
value: 0.6583571428571429
name: Maxsim Mrr@10
- type: MaxSim_map@100
value: 0.6540863099196654
name: Maxsim Map@100
- task:
type: py-late-information-retrieval
name: Py Late Information Retrieval
dataset:
name: NanoSCIDOCS
type: NanoSCIDOCS
metrics:
- type: MaxSim_accuracy@1
value: 0.42
name: Maxsim Accuracy@1
- type: MaxSim_accuracy@3
value: 0.62
name: Maxsim Accuracy@3
- type: MaxSim_accuracy@5
value: 0.66
name: Maxsim Accuracy@5
- type: MaxSim_accuracy@10
value: 0.78
name: Maxsim Accuracy@10
- type: MaxSim_precision@1
value: 0.42
name: Maxsim Precision@1
- type: MaxSim_precision@3
value: 0.29333333333333333
name: Maxsim Precision@3
- type: MaxSim_precision@5
value: 0.23199999999999998
name: Maxsim Precision@5
- type: MaxSim_precision@10
value: 0.158
name: Maxsim Precision@10
- type: MaxSim_recall@1
value: 0.08866666666666667
name: Maxsim Recall@1
- type: MaxSim_recall@3
value: 0.18166666666666664
name: Maxsim Recall@3
- type: MaxSim_recall@5
value: 0.2396666666666667
name: Maxsim Recall@5
- type: MaxSim_recall@10
value: 0.3246666666666666
name: Maxsim Recall@10
- type: MaxSim_ndcg@10
value: 0.3235935014165522
name: Maxsim Ndcg@10
- type: MaxSim_mrr@10
value: 0.5337777777777777
name: Maxsim Mrr@10
- type: MaxSim_map@100
value: 0.24429363290034992
name: Maxsim Map@100
- task:
type: pylate-custom-nano-beir
name: Pylate Custom Nano BEIR
dataset:
name: NanoBEIR mean
type: NanoBEIR_mean
metrics:
- type: MaxSim_accuracy@1
value: 0.6
name: Maxsim Accuracy@1
- type: MaxSim_accuracy@3
value: 0.7400000000000001
name: Maxsim Accuracy@3
- type: MaxSim_accuracy@5
value: 0.7999999999999999
name: Maxsim Accuracy@5
- type: MaxSim_accuracy@10
value: 0.8566666666666666
name: Maxsim Accuracy@10
- type: MaxSim_precision@1
value: 0.6
name: Maxsim Precision@1
- type: MaxSim_precision@3
value: 0.36777777777777776
name: Maxsim Precision@3
- type: MaxSim_precision@5
value: 0.27999999999999997
name: Maxsim Precision@5
- type: MaxSim_precision@10
value: 0.19566666666666666
name: Maxsim Precision@10
- type: MaxSim_recall@1
value: 0.31200553372659573
name: Maxsim Recall@1
- type: MaxSim_recall@3
value: 0.4718645032333333
name: Maxsim Recall@3
- type: MaxSim_recall@5
value: 0.5471385130432976
name: Maxsim Recall@5
- type: MaxSim_recall@10
value: 0.6207399714019047
name: Maxsim Recall@10
- type: MaxSim_ndcg@10
value: 0.6003172515072791
name: Maxsim Ndcg@10
- type: MaxSim_mrr@10
value: 0.6846428571428572
name: Maxsim Mrr@10
- type: MaxSim_map@100
value: 0.5275473706858586
name: Maxsim Map@100
---
# PyLate model based on artiwise-ai/modernbert-base-tr-uncased
This is a [PyLate](https://github.com/lightonai/pylate) model finetuned from [artiwise-ai/modernbert-base-tr-uncased](https://huggingface.co/artiwise-ai/modernbert-base-tr-uncased) on the [train](https://huggingface.co/datasets/Speedsy/msmarco-cleaned-gemini-bge-tr-uncased) dataset. It maps sentences & paragraphs to sequences of 128-dimensional dense vectors and can be used for semantic textual similarity using the MaxSim operator.
## Model Details
### Model Description
- **Model Type:** PyLate model
- **Base model:** [artiwise-ai/modernbert-base-tr-uncased](https://huggingface.co/artiwise-ai/modernbert-base-tr-uncased)
- **Document Length:** 180 tokens
- **Query Length:** 32 tokens
- **Output Dimensionality:** 128 tokens
- **Similarity Function:** MaxSim
- **Training Dataset:**
- [train](https://huggingface.co/datasets/Speedsy/msmarco-cleaned-gemini-bge-tr-uncased)
- **Language:** en
### Model Sources
- **Documentation:** [PyLate Documentation](https://lightonai.github.io/pylate/)
- **Repository:** [PyLate on GitHub](https://github.com/lightonai/pylate)
- **Hugging Face:** [PyLate models on Hugging Face](https://huggingface.co/models?library=PyLate)
### Full Model Architecture
```
ColBERT(
(0): Transformer({'max_seq_length': 179, 'do_lower_case': False}) with Transformer model: ModernBertModel
(1): Dense({'in_features': 768, 'out_features': 128, 'bias': False, 'activation_function': 'torch.nn.modules.linear.Identity'})
)
```
## Usage
First install the PyLate library:
```bash
pip install -U pylate
```
### Retrieval
PyLate provides a streamlined interface to index and retrieve documents using ColBERT models. The index leverages the Voyager HNSW index to efficiently handle document embeddings and enable fast retrieval.
#### Indexing documents
First, load the ColBERT model and initialize the Voyager index, then encode and index your documents:
```python
from pylate import indexes, models, retrieve
# Step 1: Load the ColBERT model
model = models.ColBERT(
model_name_or_path=pylate_model_id,
)
# Step 2: Initialize the Voyager index
index = indexes.Voyager(
index_folder="pylate-index",
index_name="index",
override=True, # This overwrites the existing index if any
)
# Step 3: Encode the documents
documents_ids = ["1", "2", "3"]
documents = ["document 1 text", "document 2 text", "document 3 text"]
documents_embeddings = model.encode(
documents,
batch_size=32,
is_query=False, # Ensure that it is set to False to indicate that these are documents, not queries
show_progress_bar=True,
)
# Step 4: Add document embeddings to the index by providing embeddings and corresponding ids
index.add_documents(
documents_ids=documents_ids,
documents_embeddings=documents_embeddings,
)
```
Note that you do not have to recreate the index and encode the documents every time. Once you have created an index and added the documents, you can re-use the index later by loading it:
```python
# To load an index, simply instantiate it with the correct folder/name and without overriding it
index = indexes.Voyager(
index_folder="pylate-index",
index_name="index",
)
```
#### Retrieving top-k documents for queries
Once the documents are indexed, you can retrieve the top-k most relevant documents for a given set of queries.
To do so, initialize the ColBERT retriever with the index you want to search in, encode the queries and then retrieve the top-k documents to get the top matches ids and relevance scores:
```python
# Step 1: Initialize the ColBERT retriever
retriever = retrieve.ColBERT(index=index)
# Step 2: Encode the queries
queries_embeddings = model.encode(
["query for document 3", "query for document 1"],
batch_size=32,
is_query=True, # # Ensure that it is set to False to indicate that these are queries
show_progress_bar=True,
)
# Step 3: Retrieve top-k documents
scores = retriever.retrieve(
queries_embeddings=queries_embeddings,
k=10, # Retrieve the top 10 matches for each query
)
```
### Reranking
If you only want to use the ColBERT model to perform reranking on top of your first-stage retrieval pipeline without building an index, you can simply use rank function and pass the queries and documents to rerank:
```python
from pylate import rank, models
queries = [
"query A",
"query B",
]
documents = [
["document A", "document B"],
["document 1", "document C", "document B"],
]
documents_ids = [
[1, 2],
[1, 3, 2],
]
model = models.ColBERT(
model_name_or_path=pylate_model_id,
)
queries_embeddings = model.encode(
queries,
is_query=True,
)
documents_embeddings = model.encode(
documents,
is_query=False,
)
reranked_documents = rank.rerank(
documents_ids=documents_ids,
queries_embeddings=queries_embeddings,
documents_embeddings=documents_embeddings,
)
```
## Evaluation
### Metrics
#### Py Late Information Retrieval
* Dataset: `['NanoDBPedia', 'NanoFiQA2018', 'NanoHotpotQA', 'NanoMSMARCO', 'NanoNQ', 'NanoSCIDOCS']`
* Evaluated with pylate.evaluation.pylate_information_retrieval_evaluator.PyLateInformationRetrievalEvaluator
| Metric | NanoDBPedia | NanoFiQA2018 | NanoHotpotQA | NanoMSMARCO | NanoNQ | NanoSCIDOCS |
|:--------------------|:------------|:-------------|:-------------|:------------|:-----------|:------------|
| MaxSim_accuracy@1 | 0.78 | 0.48 | 0.92 | 0.42 | 0.58 | 0.42 |
| MaxSim_accuracy@3 | 0.92 | 0.62 | 0.98 | 0.6 | 0.7 | 0.62 |
| MaxSim_accuracy@5 | 0.96 | 0.72 | 1.0 | 0.7 | 0.76 | 0.66 |
| MaxSim_accuracy@10 | 1.0 | 0.72 | 1.0 | 0.8 | 0.84 | 0.78 |
| MaxSim_precision@1 | 0.78 | 0.48 | 0.92 | 0.42 | 0.58 | 0.42 |
| MaxSim_precision@3 | 0.68 | 0.28 | 0.5133 | 0.2 | 0.24 | 0.2933 |
| MaxSim_precision@5 | 0.596 | 0.22 | 0.336 | 0.14 | 0.156 | 0.232 |
| MaxSim_precision@10 | 0.546 | 0.13 | 0.17 | 0.08 | 0.09 | 0.158 |
| MaxSim_recall@1 | 0.0808 | 0.2526 | 0.46 | 0.42 | 0.57 | 0.0887 |
| MaxSim_recall@3 | 0.1904 | 0.3991 | 0.77 | 0.6 | 0.69 | 0.1817 |
| MaxSim_recall@5 | 0.2626 | 0.5106 | 0.84 | 0.7 | 0.73 | 0.2397 |
| MaxSim_recall@10 | 0.3926 | 0.5472 | 0.85 | 0.8 | 0.81 | 0.3247 |
| **MaxSim_ndcg@10** | **0.6694** | **0.4799** | **0.834** | **0.6031** | **0.6919** | **0.3236** |
| MaxSim_mrr@10 | 0.8612 | 0.562 | 0.9517 | 0.5408 | 0.6584 | 0.5338 |
| MaxSim_map@100 | 0.5271 | 0.4136 | 0.7775 | 0.5487 | 0.6541 | 0.2443 |
#### Pylate Custom Nano BEIR
* Dataset: `NanoBEIR_mean`
* Evaluated with pylate_nano_beir_evaluator.PylateCustomNanoBEIREvaluator
| Metric | Value |
|:--------------------|:-----------|
| MaxSim_accuracy@1 | 0.6 |
| MaxSim_accuracy@3 | 0.74 |
| MaxSim_accuracy@5 | 0.8 |
| MaxSim_accuracy@10 | 0.8567 |
| MaxSim_precision@1 | 0.6 |
| MaxSim_precision@3 | 0.3678 |
| MaxSim_precision@5 | 0.28 |
| MaxSim_precision@10 | 0.1957 |
| MaxSim_recall@1 | 0.312 |
| MaxSim_recall@3 | 0.4719 |
| MaxSim_recall@5 | 0.5471 |
| MaxSim_recall@10 | 0.6207 |
| **MaxSim_ndcg@10** | **0.6003** |
| MaxSim_mrr@10 | 0.6846 |
| MaxSim_map@100 | 0.5275 |
## Training Details
### Training Dataset
#### train
* Dataset: [train](https://huggingface.co/datasets/Speedsy/msmarco-cleaned-gemini-bge-tr-uncased) at [bd034f5](https://huggingface.co/datasets/Speedsy/msmarco-cleaned-gemini-bge-tr-uncased/tree/bd034f56291b3b7a7dcde55ab0bd933977cc233e)
* Size: 443,147 training samples
* Columns: query_id
, document_ids
, and scores
* Approximate statistics based on the first 1000 samples:
| | query_id | document_ids | scores |
|:--------|:--------------------------------------------------------------------------------|:------------------------------------|:------------------------------------|
| type | string | list | list |
| details |
817836
| ['2716076', '6741935', '2681109', '5562684', '3507339', ...]
| [1.0, 0.7059561610221863, 0.21702419221401215, 0.38270196318626404, 0.20812414586544037, ...]
|
| 1045170
| ['5088671', '2953295', '8783471', '4268439', '6339935', ...]
| [1.0, 0.6493034362792969, 0.0692221149802208, 0.17963139712810516, 0.6697239875793457, ...]
|
| 1069432
| ['3724008', '314949', '8657336', '7420456', '879004', ...]
| [1.0, 0.3706032931804657, 0.3508036434650421, 0.2823200523853302, 0.17563475668430328, ...]
|
* Loss: pylate.losses.distillation.Distillation
### Training Hyperparameters
#### Non-Default Hyperparameters
- `eval_strategy`: steps
- `gradient_accumulation_steps`: 2
- `learning_rate`: 3e-05
- `num_train_epochs`: 1
- `bf16`: True
#### All Hyperparameters