PyLate model based on microsoft/Multilingual-MiniLM-L12-H384

This is a PyLate model finetuned from microsoft/Multilingual-MiniLM-L12-H384 on the train dataset. It maps sentences & paragraphs to sequences of 128-dimensional dense vectors and can be used for semantic textual similarity using the MaxSim operator.

Model Details

Model Description

  • Model Type: PyLate model
  • Base model: microsoft/Multilingual-MiniLM-L12-H384
  • Document Length: 180 tokens
  • Query Length: 32 tokens
  • Output Dimensionality: 128 tokens
  • Similarity Function: MaxSim
  • Training Dataset:
  • Language: en

Model Sources

Full Model Architecture

ColBERT(
  (0): Transformer({'max_seq_length': 179, 'do_lower_case': False}) with Transformer model: BertModel 
  (1): Dense({'in_features': 384, 'out_features': 128, 'bias': False, 'activation_function': 'torch.nn.modules.linear.Identity'})
)

Usage

First install the PyLate library:

pip install -U pylate

Retrieval

PyLate provides a streamlined interface to index and retrieve documents using ColBERT models. The index leverages the Voyager HNSW index to efficiently handle document embeddings and enable fast retrieval.

Indexing documents

First, load the ColBERT model and initialize the Voyager index, then encode and index your documents:

from pylate import indexes, models, retrieve

# Step 1: Load the ColBERT model
model = models.ColBERT(
    model_name_or_path=pylate_model_id,
)

# Step 2: Initialize the Voyager index
index = indexes.Voyager(
    index_folder="pylate-index",
    index_name="index",
    override=True,  # This overwrites the existing index if any
)

# Step 3: Encode the documents
documents_ids = ["1", "2", "3"]
documents = ["document 1 text", "document 2 text", "document 3 text"]

documents_embeddings = model.encode(
    documents,
    batch_size=32,
    is_query=False,  # Ensure that it is set to False to indicate that these are documents, not queries
    show_progress_bar=True,
)

# Step 4: Add document embeddings to the index by providing embeddings and corresponding ids
index.add_documents(
    documents_ids=documents_ids,
    documents_embeddings=documents_embeddings,
)

Note that you do not have to recreate the index and encode the documents every time. Once you have created an index and added the documents, you can re-use the index later by loading it:

# To load an index, simply instantiate it with the correct folder/name and without overriding it
index = indexes.Voyager(
    index_folder="pylate-index",
    index_name="index",
)

Retrieving top-k documents for queries

Once the documents are indexed, you can retrieve the top-k most relevant documents for a given set of queries. To do so, initialize the ColBERT retriever with the index you want to search in, encode the queries and then retrieve the top-k documents to get the top matches ids and relevance scores:

# Step 1: Initialize the ColBERT retriever
retriever = retrieve.ColBERT(index=index)

# Step 2: Encode the queries
queries_embeddings = model.encode(
    ["query for document 3", "query for document 1"],
    batch_size=32,
    is_query=True,  #  # Ensure that it is set to False to indicate that these are queries
    show_progress_bar=True,
)

# Step 3: Retrieve top-k documents
scores = retriever.retrieve(
    queries_embeddings=queries_embeddings,
    k=10,  # Retrieve the top 10 matches for each query
)

Reranking

If you only want to use the ColBERT model to perform reranking on top of your first-stage retrieval pipeline without building an index, you can simply use rank function and pass the queries and documents to rerank:

from pylate import rank, models

queries = [
    "query A",
    "query B",
]

documents = [
    ["document A", "document B"],
    ["document 1", "document C", "document B"],
]

documents_ids = [
    [1, 2],
    [1, 3, 2],
]

model = models.ColBERT(
    model_name_or_path=pylate_model_id,
)

queries_embeddings = model.encode(
    queries,
    is_query=True,
)

documents_embeddings = model.encode(
    documents,
    is_query=False,
)

reranked_documents = rank.rerank(
    documents_ids=documents_ids,
    queries_embeddings=queries_embeddings,
    documents_embeddings=documents_embeddings,
)

Training Details

Training Dataset

train

  • Dataset: train at b9b0f7f
  • Size: 798,036 training samples
  • Columns: query_id, document_ids, and scores
  • Approximate statistics based on the first 1000 samples:
    query_id document_ids scores
    type string list list
    details
    • min: 4 tokens
    • mean: 5.82 tokens
    • max: 6 tokens
    • size: 32 elements
    • size: 32 elements
  • Samples:
    query_id document_ids scores
    817836 ['2716076', '6741935', '2681109', '5562684', '3507339', ...] [1.0, 0.7059561610221863, 0.21702419221401215, 0.38270196318626404, 0.20812414586544037, ...]
    1045170 ['5088671', '2953295', '8783471', '4268439', '6339935', ...] [1.0, 0.6493034362792969, 0.0692221149802208, 0.17963139712810516, 0.6697239875793457, ...]
    1154488 ['6498614', '3770829', '1060712', '2590533', '7672044', ...] [0.9497447609901428, 0.6662212610244751, 0.7423420548439026, 1.0, 0.6580896973609924, ...]
  • Loss: pylate.losses.distillation.Distillation

Training Hyperparameters

Non-Default Hyperparameters

  • per_device_train_batch_size: 16
  • learning_rate: 3e-05
  • num_train_epochs: 1
  • fp16: True

All Hyperparameters

Click to expand
  • overwrite_output_dir: False
  • do_predict: False
  • eval_strategy: no
  • prediction_loss_only: True
  • per_device_train_batch_size: 16
  • per_device_eval_batch_size: 8
  • per_gpu_train_batch_size: None
  • per_gpu_eval_batch_size: None
  • gradient_accumulation_steps: 1
  • eval_accumulation_steps: None
  • torch_empty_cache_steps: None
  • learning_rate: 3e-05
  • weight_decay: 0.0
  • adam_beta1: 0.9
  • adam_beta2: 0.999
  • adam_epsilon: 1e-08
  • max_grad_norm: 1.0
  • num_train_epochs: 1
  • max_steps: -1
  • lr_scheduler_type: linear
  • lr_scheduler_kwargs: {}
  • warmup_ratio: 0.0
  • warmup_steps: 0
  • log_level: passive
  • log_level_replica: warning
  • log_on_each_node: True
  • logging_nan_inf_filter: True
  • save_safetensors: True
  • save_on_each_node: False
  • save_only_model: False
  • restore_callback_states_from_checkpoint: False
  • no_cuda: False
  • use_cpu: False
  • use_mps_device: False
  • seed: 42
  • data_seed: None
  • jit_mode_eval: False
  • use_ipex: False
  • bf16: False
  • fp16: True
  • fp16_opt_level: O1
  • half_precision_backend: auto
  • bf16_full_eval: False
  • fp16_full_eval: False
  • tf32: None
  • local_rank: 0
  • ddp_backend: None
  • tpu_num_cores: None
  • tpu_metrics_debug: False
  • debug: []
  • dataloader_drop_last: False
  • dataloader_num_workers: 0
  • dataloader_prefetch_factor: None
  • past_index: -1
  • disable_tqdm: False
  • remove_unused_columns: True
  • label_names: None
  • load_best_model_at_end: False
  • ignore_data_skip: False
  • fsdp: []
  • fsdp_min_num_params: 0
  • fsdp_config: {'min_num_params': 0, 'xla': False, 'xla_fsdp_v2': False, 'xla_fsdp_grad_ckpt': False}
  • fsdp_transformer_layer_cls_to_wrap: None
  • accelerator_config: {'split_batches': False, 'dispatch_batches': None, 'even_batches': True, 'use_seedable_sampler': True, 'non_blocking': False, 'gradient_accumulation_kwargs': None}
  • deepspeed: None
  • label_smoothing_factor: 0.0
  • optim: adamw_torch
  • optim_args: None
  • adafactor: False
  • group_by_length: False
  • length_column_name: length
  • ddp_find_unused_parameters: None
  • ddp_bucket_cap_mb: None
  • ddp_broadcast_buffers: False
  • dataloader_pin_memory: True
  • dataloader_persistent_workers: False
  • skip_memory_metrics: True
  • use_legacy_prediction_loop: False
  • push_to_hub: False
  • resume_from_checkpoint: None
  • hub_model_id: None
  • hub_strategy: every_save
  • hub_private_repo: None
  • hub_always_push: False
  • gradient_checkpointing: False
  • gradient_checkpointing_kwargs: None
  • include_inputs_for_metrics: False
  • include_for_metrics: []
  • eval_do_concat_batches: True
  • fp16_backend: auto
  • push_to_hub_model_id: None
  • push_to_hub_organization: None
  • mp_parameters:
  • auto_find_batch_size: False
  • full_determinism: False
  • torchdynamo: None
  • ray_scope: last
  • ddp_timeout: 1800
  • torch_compile: False
  • torch_compile_backend: None
  • torch_compile_mode: None
  • dispatch_batches: None
  • split_batches: None
  • include_tokens_per_second: False
  • include_num_input_tokens_seen: False
  • neftune_noise_alpha: None
  • optim_target_modules: None
  • batch_eval_metrics: False
  • eval_on_start: False
  • use_liger_kernel: False
  • eval_use_gather_object: False
  • average_tokens_across_devices: False
  • prompts: None
  • batch_sampler: batch_sampler
  • multi_dataset_batch_sampler: proportional

Training Logs

Epoch Step Training Loss
0.0100 500 0.0305
0.0200 1000 0.027
0.0301 1500 0.026
0.0401 2000 0.0253
0.0501 2500 0.0249
0.0601 3000 0.0239
0.0702 3500 0.0239
0.0802 4000 0.0236
0.0902 4500 0.0236
0.1002 5000 0.0232
0.1103 5500 0.0229
0.1203 6000 0.0228
0.1303 6500 0.0226
0.1403 7000 0.0226
0.1504 7500 0.0222
0.1604 8000 0.0223
0.1704 8500 0.0218
0.1804 9000 0.0219
0.1905 9500 0.0221
0.2005 10000 0.0219
0.2105 10500 0.0216
0.2205 11000 0.0211
0.2306 11500 0.0214
0.2406 12000 0.0211
0.2506 12500 0.0214
0.2606 13000 0.0212
0.2707 13500 0.0209
0.2807 14000 0.0206
0.2907 14500 0.0208
0.3007 15000 0.0204
0.3108 15500 0.0207
0.3208 16000 0.0204
0.3308 16500 0.0204
0.3408 17000 0.0204
0.3509 17500 0.0203
0.3609 18000 0.0199
0.3709 18500 0.02
0.3809 19000 0.0199
0.3910 19500 0.0197
0.4010 20000 0.0197
0.4110 20500 0.0199
0.4210 21000 0.0198
0.4311 21500 0.0196
0.4411 22000 0.02
0.4511 22500 0.0196
0.4611 23000 0.0196
0.4711 23500 0.0192
0.4812 24000 0.0195
0.4912 24500 0.0197
0.5012 25000 0.0194
0.5112 25500 0.0191
0.5213 26000 0.019
0.5313 26500 0.019
0.5413 27000 0.0192
0.5513 27500 0.0193
0.5614 28000 0.0189
0.5714 28500 0.019
0.5814 29000 0.0189
0.5914 29500 0.0189
0.6015 30000 0.0189
0.6115 30500 0.0187
0.6215 31000 0.0187
0.6315 31500 0.0186
0.6416 32000 0.0187
0.6516 32500 0.0187
0.6616 33000 0.0186
0.6716 33500 0.0185
0.6817 34000 0.0188
0.6917 34500 0.0186
0.7017 35000 0.0185
0.7117 35500 0.0182
0.7218 36000 0.0184
0.7318 36500 0.0183
0.7418 37000 0.0184
0.7518 37500 0.0183
0.7619 38000 0.0182
0.7719 38500 0.0184
0.7819 39000 0.0183
0.7919 39500 0.0184
0.8020 40000 0.0184
0.8120 40500 0.0182
0.8220 41000 0.0182
0.8320 41500 0.0182
0.8421 42000 0.0183
0.8521 42500 0.0183
0.8621 43000 0.0179
0.8721 43500 0.0178
0.8822 44000 0.0178
0.8922 44500 0.018
0.9022 45000 0.0181
0.9122 45500 0.0178
0.9223 46000 0.018
0.9323 46500 0.018
0.9423 47000 0.0178
0.9523 47500 0.0178
0.9623 48000 0.0179
0.9724 48500 0.018
0.9824 49000 0.0181
0.9924 49500 0.018

Framework Versions

  • Python: 3.11.11
  • Sentence Transformers: 3.4.1
  • PyLate: 1.1.7
  • Transformers: 4.48.2
  • PyTorch: 2.6.0+cu124
  • Accelerate: 1.5.2
  • Datasets: 3.5.0
  • Tokenizers: 0.21.1

Citation

BibTeX

Sentence Transformers

@inproceedings{reimers-2019-sentence-bert,
    title = "Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks",
    author = "Reimers, Nils and Gurevych, Iryna",
    booktitle = "Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing",
    month = "11",
    year = "2019",
    publisher = "Association for Computational Linguistics",
    url = "https://arxiv.org/abs/1908.10084"
}

PyLate

@misc{PyLate,
title={PyLate: Flexible Training and Retrieval for Late Interaction Models},
author={Chaffin, Antoine and Sourty, Raphaël},
url={https://github.com/lightonai/pylate},
year={2024}
}
Downloads last month
3
Safetensors
Model size
118M params
Tensor type
F32
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for Speedsy/pylate-49500

Finetuned
(34)
this model