PyLate model based on Speedsy/turkish-multilingual-e5-small-32768

This is a PyLate model finetuned from Speedsy/turkish-multilingual-e5-small-32768 on the train dataset. It maps sentences & paragraphs to sequences of 128-dimensional dense vectors and can be used for semantic textual similarity using the MaxSim operator.

Model Details

Model Description

Model Sources

Full Model Architecture

ColBERT(
  (0): Transformer({'max_seq_length': 179, 'do_lower_case': False}) with Transformer model: BertModel 
  (1): Dense({'in_features': 384, 'out_features': 128, 'bias': False, 'activation_function': 'torch.nn.modules.linear.Identity'})
)

Usage

First install the PyLate library:

pip install -U pylate

Retrieval

PyLate provides a streamlined interface to index and retrieve documents using ColBERT models. The index leverages the Voyager HNSW index to efficiently handle document embeddings and enable fast retrieval.

Indexing documents

First, load the ColBERT model and initialize the Voyager index, then encode and index your documents:

from pylate import indexes, models, retrieve

# Step 1: Load the ColBERT model
model = models.ColBERT(
    model_name_or_path=pylate_model_id,
)

# Step 2: Initialize the Voyager index
index = indexes.Voyager(
    index_folder="pylate-index",
    index_name="index",
    override=True,  # This overwrites the existing index if any
)

# Step 3: Encode the documents
documents_ids = ["1", "2", "3"]
documents = ["document 1 text", "document 2 text", "document 3 text"]

documents_embeddings = model.encode(
    documents,
    batch_size=32,
    is_query=False,  # Ensure that it is set to False to indicate that these are documents, not queries
    show_progress_bar=True,
)

# Step 4: Add document embeddings to the index by providing embeddings and corresponding ids
index.add_documents(
    documents_ids=documents_ids,
    documents_embeddings=documents_embeddings,
)

Note that you do not have to recreate the index and encode the documents every time. Once you have created an index and added the documents, you can re-use the index later by loading it:

# To load an index, simply instantiate it with the correct folder/name and without overriding it
index = indexes.Voyager(
    index_folder="pylate-index",
    index_name="index",
)

Retrieving top-k documents for queries

Once the documents are indexed, you can retrieve the top-k most relevant documents for a given set of queries. To do so, initialize the ColBERT retriever with the index you want to search in, encode the queries and then retrieve the top-k documents to get the top matches ids and relevance scores:

# Step 1: Initialize the ColBERT retriever
retriever = retrieve.ColBERT(index=index)

# Step 2: Encode the queries
queries_embeddings = model.encode(
    ["query for document 3", "query for document 1"],
    batch_size=32,
    is_query=True,  #  # Ensure that it is set to False to indicate that these are queries
    show_progress_bar=True,
)

# Step 3: Retrieve top-k documents
scores = retriever.retrieve(
    queries_embeddings=queries_embeddings,
    k=10,  # Retrieve the top 10 matches for each query
)

Reranking

If you only want to use the ColBERT model to perform reranking on top of your first-stage retrieval pipeline without building an index, you can simply use rank function and pass the queries and documents to rerank:

from pylate import rank, models

queries = [
    "query A",
    "query B",
]

documents = [
    ["document A", "document B"],
    ["document 1", "document C", "document B"],
]

documents_ids = [
    [1, 2],
    [1, 3, 2],
]

model = models.ColBERT(
    model_name_or_path=pylate_model_id,
)

queries_embeddings = model.encode(
    queries,
    is_query=True,
)

documents_embeddings = model.encode(
    documents,
    is_query=False,
)

reranked_documents = rank.rerank(
    documents_ids=documents_ids,
    queries_embeddings=queries_embeddings,
    documents_embeddings=documents_embeddings,
)

Evaluation

Metrics

Py Late Information Retrieval

  • Dataset: ['NanoDBPedia', 'NanoFiQA2018', 'NanoHotpotQA', 'NanoMSMARCO', 'NanoNQ', 'NanoSCIDOCS']
  • Evaluated with pylate.evaluation.pylate_information_retrieval_evaluator.PyLateInformationRetrievalEvaluator
Metric NanoDBPedia NanoFiQA2018 NanoHotpotQA NanoMSMARCO NanoNQ NanoSCIDOCS
MaxSim_accuracy@1 0.82 0.32 0.76 0.36 0.6 0.36
MaxSim_accuracy@3 0.92 0.48 0.94 0.56 0.7 0.52
MaxSim_accuracy@5 0.96 0.54 0.94 0.62 0.74 0.56
MaxSim_accuracy@10 0.96 0.6 0.98 0.72 0.8 0.72
MaxSim_precision@1 0.82 0.32 0.76 0.36 0.6 0.36
MaxSim_precision@3 0.66 0.22 0.4933 0.1867 0.24 0.26
MaxSim_precision@5 0.596 0.164 0.316 0.124 0.152 0.188
MaxSim_precision@10 0.526 0.096 0.172 0.072 0.082 0.15
MaxSim_recall@1 0.1068 0.1872 0.38 0.36 0.57 0.0757
MaxSim_recall@3 0.182 0.3065 0.74 0.56 0.68 0.1597
MaxSim_recall@5 0.255 0.372 0.79 0.62 0.71 0.1917
MaxSim_recall@10 0.3752 0.4196 0.86 0.72 0.74 0.3067
MaxSim_ndcg@10 0.6615 0.3599 0.7818 0.5325 0.6693 0.2927
MaxSim_mrr@10 0.8767 0.4126 0.8462 0.4735 0.6647 0.4673
MaxSim_map@100 0.5096 0.3126 0.7096 0.4837 0.6455 0.2213

Pylate Custom Nano BEIR

  • Dataset: NanoBEIR_mean
  • Evaluated with pylate_nano_beir_evaluator.PylateCustomNanoBEIREvaluator
Metric Value
MaxSim_accuracy@1 0.5367
MaxSim_accuracy@3 0.6867
MaxSim_accuracy@5 0.7267
MaxSim_accuracy@10 0.7967
MaxSim_precision@1 0.5367
MaxSim_precision@3 0.3433
MaxSim_precision@5 0.2567
MaxSim_precision@10 0.183
MaxSim_recall@1 0.2799
MaxSim_recall@3 0.438
MaxSim_recall@5 0.4898
MaxSim_recall@10 0.5702
MaxSim_ndcg@10 0.5496
MaxSim_mrr@10 0.6235
MaxSim_map@100 0.4804

Training Details

Training Dataset

train

  • Dataset: train at 1072b6b
  • Size: 443,147 training samples
  • Columns: query_id, document_ids, and scores
  • Approximate statistics based on the first 1000 samples:
    query_id document_ids scores
    type string list list
    details
    • min: 5 tokens
    • mean: 5.83 tokens
    • max: 6 tokens
    • size: 32 elements
    • size: 32 elements
  • Samples:
    query_id document_ids scores
    817836 ['2716076', '6741935', '2681109', '5562684', '3507339', ...] [1.0, 0.7059561610221863, 0.21702419221401215, 0.38270196318626404, 0.20812414586544037, ...]
    1045170 ['5088671', '2953295', '8783471', '4268439', '6339935', ...] [1.0, 0.6493034362792969, 0.0692221149802208, 0.17963139712810516, 0.6697239875793457, ...]
    1069432 ['3724008', '314949', '8657336', '7420456', '879004', ...] [1.0, 0.3706032931804657, 0.3508036434650421, 0.2823200523853302, 0.17563475668430328, ...]
  • Loss: pylate.losses.distillation.Distillation

Training Hyperparameters

Non-Default Hyperparameters

  • eval_strategy: steps
  • per_device_train_batch_size: 16
  • learning_rate: 3e-05
  • num_train_epochs: 1
  • bf16: True

All Hyperparameters

Click to expand
  • overwrite_output_dir: False
  • do_predict: False
  • eval_strategy: steps
  • prediction_loss_only: True
  • per_device_train_batch_size: 16
  • per_device_eval_batch_size: 8
  • per_gpu_train_batch_size: None
  • per_gpu_eval_batch_size: None
  • gradient_accumulation_steps: 1
  • eval_accumulation_steps: None
  • torch_empty_cache_steps: None
  • learning_rate: 3e-05
  • weight_decay: 0.0
  • adam_beta1: 0.9
  • adam_beta2: 0.999
  • adam_epsilon: 1e-08
  • max_grad_norm: 1.0
  • num_train_epochs: 1
  • max_steps: -1
  • lr_scheduler_type: linear
  • lr_scheduler_kwargs: {}
  • warmup_ratio: 0.0
  • warmup_steps: 0
  • log_level: passive
  • log_level_replica: warning
  • log_on_each_node: True
  • logging_nan_inf_filter: True
  • save_safetensors: True
  • save_on_each_node: False
  • save_only_model: False
  • restore_callback_states_from_checkpoint: False
  • no_cuda: False
  • use_cpu: False
  • use_mps_device: False
  • seed: 42
  • data_seed: None
  • jit_mode_eval: False
  • use_ipex: False
  • bf16: True
  • fp16: False
  • fp16_opt_level: O1
  • half_precision_backend: auto
  • bf16_full_eval: False
  • fp16_full_eval: False
  • tf32: None
  • local_rank: 0
  • ddp_backend: None
  • tpu_num_cores: None
  • tpu_metrics_debug: False
  • debug: []
  • dataloader_drop_last: False
  • dataloader_num_workers: 0
  • dataloader_prefetch_factor: None
  • past_index: -1
  • disable_tqdm: False
  • remove_unused_columns: True
  • label_names: None
  • load_best_model_at_end: False
  • ignore_data_skip: False
  • fsdp: []
  • fsdp_min_num_params: 0
  • fsdp_config: {'min_num_params': 0, 'xla': False, 'xla_fsdp_v2': False, 'xla_fsdp_grad_ckpt': False}
  • fsdp_transformer_layer_cls_to_wrap: None
  • accelerator_config: {'split_batches': False, 'dispatch_batches': None, 'even_batches': True, 'use_seedable_sampler': True, 'non_blocking': False, 'gradient_accumulation_kwargs': None}
  • deepspeed: None
  • label_smoothing_factor: 0.0
  • optim: adamw_torch
  • optim_args: None
  • adafactor: False
  • group_by_length: False
  • length_column_name: length
  • ddp_find_unused_parameters: None
  • ddp_bucket_cap_mb: None
  • ddp_broadcast_buffers: False
  • dataloader_pin_memory: True
  • dataloader_persistent_workers: False
  • skip_memory_metrics: True
  • use_legacy_prediction_loop: False
  • push_to_hub: False
  • resume_from_checkpoint: None
  • hub_model_id: None
  • hub_strategy: every_save
  • hub_private_repo: None
  • hub_always_push: False
  • gradient_checkpointing: False
  • gradient_checkpointing_kwargs: None
  • include_inputs_for_metrics: False
  • include_for_metrics: []
  • eval_do_concat_batches: True
  • fp16_backend: auto
  • push_to_hub_model_id: None
  • push_to_hub_organization: None
  • mp_parameters:
  • auto_find_batch_size: False
  • full_determinism: False
  • torchdynamo: None
  • ray_scope: last
  • ddp_timeout: 1800
  • torch_compile: False
  • torch_compile_backend: None
  • torch_compile_mode: None
  • dispatch_batches: None
  • split_batches: None
  • include_tokens_per_second: False
  • include_num_input_tokens_seen: False
  • neftune_noise_alpha: None
  • optim_target_modules: None
  • batch_eval_metrics: False
  • eval_on_start: False
  • use_liger_kernel: False
  • eval_use_gather_object: False
  • average_tokens_across_devices: False
  • prompts: None
  • batch_sampler: batch_sampler
  • multi_dataset_batch_sampler: proportional

Training Logs

Click to expand
Epoch Step Training Loss NanoDBPedia_MaxSim_ndcg@10 NanoFiQA2018_MaxSim_ndcg@10 NanoHotpotQA_MaxSim_ndcg@10 NanoMSMARCO_MaxSim_ndcg@10 NanoNQ_MaxSim_ndcg@10 NanoSCIDOCS_MaxSim_ndcg@10 NanoBEIR_mean_MaxSim_ndcg@10
0.0007 20 0.0324 - - - - - - -
0.0014 40 0.0293 - - - - - - -
0.0022 60 0.0296 - - - - - - -
0.0029 80 0.0282 - - - - - - -
0.0036 100 0.0298 - - - - - - -
0.0043 120 0.0281 - - - - - - -
0.0051 140 0.0285 - - - - - - -
0.0058 160 0.0275 - - - - - - -
0.0065 180 0.0289 - - - - - - -
0.0072 200 0.0276 - - - - - - -
0.0079 220 0.0276 - - - - - - -
0.0087 240 0.0269 - - - - - - -
0.0094 260 0.0248 - - - - - - -
0.0101 280 0.0254 - - - - - - -
0.0108 300 0.0248 - - - - - - -
0.0116 320 0.0248 - - - - - - -
0.0123 340 0.0246 - - - - - - -
0.0130 360 0.0257 - - - - - - -
0.0137 380 0.0243 - - - - - - -
0.0144 400 0.025 - - - - - - -
0.0152 420 0.0243 - - - - - - -
0.0159 440 0.0247 - - - - - - -
0.0166 460 0.0261 - - - - - - -
0.0173 480 0.0232 - - - - - - -
0.0181 500 0.0239 0.6474 0.3140 0.7666 0.5267 0.6014 0.2568 0.5188
0.0188 520 0.0251 - - - - - - -
0.0195 540 0.0242 - - - - - - -
0.0202 560 0.0243 - - - - - - -
0.0209 580 0.0238 - - - - - - -
0.0217 600 0.0228 - - - - - - -
0.0224 620 0.0243 - - - - - - -
0.0231 640 0.0228 - - - - - - -
0.0238 660 0.0237 - - - - - - -
0.0246 680 0.0239 - - - - - - -
0.0253 700 0.0238 - - - - - - -
0.0260 720 0.0248 - - - - - - -
0.0267 740 0.0234 - - - - - - -
0.0274 760 0.0242 - - - - - - -
0.0282 780 0.0238 - - - - - - -
0.0289 800 0.0224 - - - - - - -
0.0296 820 0.0237 - - - - - - -
0.0303 840 0.0238 - - - - - - -
0.0311 860 0.0234 - - - - - - -
0.0318 880 0.0238 - - - - - - -
0.0325 900 0.023 - - - - - - -
0.0332 920 0.0239 - - - - - - -
0.0339 940 0.0232 - - - - - - -
0.0347 960 0.0239 - - - - - - -
0.0354 980 0.0239 - - - - - - -
0.0361 1000 0.0241 0.6389 0.3160 0.7573 0.5378 0.5876 0.2993 0.5228
0.0368 1020 0.0234 - - - - - - -
0.0375 1040 0.0229 - - - - - - -
0.0383 1060 0.0236 - - - - - - -
0.0390 1080 0.0232 - - - - - - -
0.0397 1100 0.0236 - - - - - - -
0.0404 1120 0.0236 - - - - - - -
0.0412 1140 0.022 - - - - - - -
0.0419 1160 0.0217 - - - - - - -
0.0426 1180 0.0233 - - - - - - -
0.0433 1200 0.0234 - - - - - - -
0.0440 1220 0.0233 - - - - - - -
0.0448 1240 0.0235 - - - - - - -
0.0455 1260 0.0242 - - - - - - -
0.0462 1280 0.0236 - - - - - - -
0.0469 1300 0.023 - - - - - - -
0.0477 1320 0.0233 - - - - - - -
0.0484 1340 0.0232 - - - - - - -
0.0491 1360 0.0225 - - - - - - -
0.0498 1380 0.0215 - - - - - - -
0.0505 1400 0.0212 - - - - - - -
0.0513 1420 0.0222 - - - - - - -
0.0520 1440 0.0229 - - - - - - -
0.0527 1460 0.0225 - - - - - - -
0.0534 1480 0.0249 - - - - - - -
0.0542 1500 0.0234 0.6643 0.3292 0.7842 0.5483 0.6179 0.2975 0.5402
0.0549 1520 0.0236 - - - - - - -
0.0556 1540 0.021 - - - - - - -
0.0563 1560 0.0226 - - - - - - -
0.0570 1580 0.0236 - - - - - - -
0.0578 1600 0.0208 - - - - - - -
0.0585 1620 0.0216 - - - - - - -
0.0592 1640 0.0231 - - - - - - -
0.0599 1660 0.0225 - - - - - - -
0.0607 1680 0.0219 - - - - - - -
0.0614 1700 0.0213 - - - - - - -
0.0621 1720 0.0223 - - - - - - -
0.0628 1740 0.0234 - - - - - - -
0.0635 1760 0.0217 - - - - - - -
0.0643 1780 0.023 - - - - - - -
0.0650 1800 0.0231 - - - - - - -
0.0657 1820 0.0224 - - - - - - -
0.0664 1840 0.0229 - - - - - - -
0.0672 1860 0.0221 - - - - - - -
0.0679 1880 0.0221 - - - - - - -
0.0686 1900 0.0228 - - - - - - -
0.0693 1920 0.0217 - - - - - - -
0.0700 1940 0.024 - - - - - - -
0.0708 1960 0.0232 - - - - - - -
0.0715 1980 0.023 - - - - - - -
0.0722 2000 0.0232 0.6557 0.3446 0.7881 0.5640 0.6351 0.2824 0.5450
0.0729 2020 0.0229 - - - - - - -
0.0737 2040 0.0221 - - - - - - -
0.0744 2060 0.0221 - - - - - - -
0.0751 2080 0.0222 - - - - - - -
0.0758 2100 0.0223 - - - - - - -
0.0765 2120 0.0237 - - - - - - -
0.0773 2140 0.0227 - - - - - - -
0.0780 2160 0.0233 - - - - - - -
0.0787 2180 0.0228 - - - - - - -
0.0794 2200 0.0213 - - - - - - -
0.0802 2220 0.0222 - - - - - - -
0.0809 2240 0.0231 - - - - - - -
0.0816 2260 0.0225 - - - - - - -
0.0823 2280 0.0234 - - - - - - -
0.0830 2300 0.0222 - - - - - - -
0.0838 2320 0.0225 - - - - - - -
0.0845 2340 0.0224 - - - - - - -
0.0852 2360 0.0217 - - - - - - -
0.0859 2380 0.0217 - - - - - - -
0.0867 2400 0.0228 - - - - - - -
0.0874 2420 0.0228 - - - - - - -
0.0881 2440 0.0229 - - - - - - -
0.0888 2460 0.0223 - - - - - - -
0.0895 2480 0.0215 - - - - - - -
0.0903 2500 0.0224 0.6657 0.3728 0.7859 0.5651 0.6248 0.2813 0.5492
0.0910 2520 0.0221 - - - - - - -
0.0917 2540 0.0213 - - - - - - -
0.0924 2560 0.0226 - - - - - - -
0.0932 2580 0.022 - - - - - - -
0.0939 2600 0.0219 - - - - - - -
0.0946 2620 0.0224 - - - - - - -
0.0953 2640 0.0222 - - - - - - -
0.0960 2660 0.0211 - - - - - - -
0.0968 2680 0.0222 - - - - - - -
0.0975 2700 0.0224 - - - - - - -
0.0982 2720 0.0215 - - - - - - -
0.0989 2740 0.0214 - - - - - - -
0.0996 2760 0.0209 - - - - - - -
0.1004 2780 0.0211 - - - - - - -
0.1011 2800 0.0229 - - - - - - -
0.1018 2820 0.0214 - - - - - - -
0.1025 2840 0.0218 - - - - - - -
0.1033 2860 0.0208 - - - - - - -
0.1040 2880 0.0235 - - - - - - -
0.1047 2900 0.0228 - - - - - - -
0.1054 2920 0.021 - - - - - - -
0.1061 2940 0.0207 - - - - - - -
0.1069 2960 0.023 - - - - - - -
0.1076 2980 0.0213 - - - - - - -
0.1083 3000 0.022 0.6615 0.3599 0.7818 0.5325 0.6693 0.2927 0.5496

Framework Versions

  • Python: 3.11.12
  • Sentence Transformers: 4.0.2
  • PyLate: 1.2.0
  • Transformers: 4.48.2
  • PyTorch: 2.6.0+cu124
  • Accelerate: 1.6.0
  • Datasets: 3.6.0
  • Tokenizers: 0.21.1

Citation

BibTeX

Sentence Transformers

@inproceedings{reimers-2019-sentence-bert,
    title = "Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks",
    author = "Reimers, Nils and Gurevych, Iryna",
    booktitle = "Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing",
    month = "11",
    year = "2019",
    publisher = "Association for Computational Linguistics",
    url = "https://arxiv.org/abs/1908.10084"
}

PyLate

@misc{PyLate,
title={PyLate: Flexible Training and Retrieval for Late Interaction Models},
author={Chaffin, Antoine and Sourty, Raphaël},
url={https://github.com/lightonai/pylate},
year={2024}
}
Downloads last month
4
Safetensors
Model size
34.2M params
Tensor type
F32
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for Speedsy/turkish-multilingual-e5-small-32768-colbert-cleaned-data-3000

Finetuned
(42)
this model

Evaluation results