PyLate model based on nreimers/MiniLM-L6-H384-uncased

This is a PyLate model finetuned from nreimers/MiniLM-L6-H384-uncased. It maps sentences & paragraphs to sequences of 128-dimensional dense vectors and can be used for semantic textual similarity using the MaxSim operator.

Model Details

Model Description

  • Model Type: PyLate model
  • Base model: nreimers/MiniLM-L6-H384-uncased
  • Document Length: 180 tokens
  • Query Length: 32 tokens
  • Output Dimensionality: 128 tokens
  • Similarity Function: MaxSim

Model Sources

Full Model Architecture

ColBERT(
  (0): Transformer({'max_seq_length': 179, 'do_lower_case': False}) with Transformer model: BertModel 
  (1): Dense({'in_features': 384, 'out_features': 128, 'bias': False, 'activation_function': 'torch.nn.modules.linear.Identity'})
)

Usage

First install the PyLate library:

pip install -U pylate

Retrieval

PyLate provides a streamlined interface to index and retrieve documents using ColBERT models. The index leverages the Voyager HNSW index to efficiently handle document embeddings and enable fast retrieval.

Indexing documents

First, load the ColBERT model and initialize the Voyager index, then encode and index your documents:

from pylate import indexes, models, retrieve

# Step 1: Load the ColBERT model
model = models.ColBERT(
    model_name_or_path=ayushexel/colbert-MiniLM-L6-H384-uncased-3-neg-1-epoch-gooaq-1995000,
)

# Step 2: Initialize the Voyager index
index = indexes.Voyager(
    index_folder="pylate-index",
    index_name="index",
    override=True,  # This overwrites the existing index if any
)

# Step 3: Encode the documents
documents_ids = ["1", "2", "3"]
documents = ["document 1 text", "document 2 text", "document 3 text"]

documents_embeddings = model.encode(
    documents,
    batch_size=32,
    is_query=False,  # Ensure that it is set to False to indicate that these are documents, not queries
    show_progress_bar=True,
)

# Step 4: Add document embeddings to the index by providing embeddings and corresponding ids
index.add_documents(
    documents_ids=documents_ids,
    documents_embeddings=documents_embeddings,
)

Note that you do not have to recreate the index and encode the documents every time. Once you have created an index and added the documents, you can re-use the index later by loading it:

# To load an index, simply instantiate it with the correct folder/name and without overriding it
index = indexes.Voyager(
    index_folder="pylate-index",
    index_name="index",
)

Retrieving top-k documents for queries

Once the documents are indexed, you can retrieve the top-k most relevant documents for a given set of queries. To do so, initialize the ColBERT retriever with the index you want to search in, encode the queries and then retrieve the top-k documents to get the top matches ids and relevance scores:

# Step 1: Initialize the ColBERT retriever
retriever = retrieve.ColBERT(index=index)

# Step 2: Encode the queries
queries_embeddings = model.encode(
    ["query for document 3", "query for document 1"],
    batch_size=32,
    is_query=True,  #  # Ensure that it is set to False to indicate that these are queries
    show_progress_bar=True,
)

# Step 3: Retrieve top-k documents
scores = retriever.retrieve(
    queries_embeddings=queries_embeddings,
    k=10,  # Retrieve the top 10 matches for each query
)

Reranking

If you only want to use the ColBERT model to perform reranking on top of your first-stage retrieval pipeline without building an index, you can simply use rank function and pass the queries and documents to rerank:

from pylate import rank, models

queries = [
    "query A",
    "query B",
]

documents = [
    ["document A", "document B"],
    ["document 1", "document C", "document B"],
]

documents_ids = [
    [1, 2],
    [1, 3, 2],
]

model = models.ColBERT(
    model_name_or_path=ayushexel/colbert-MiniLM-L6-H384-uncased-3-neg-1-epoch-gooaq-1995000,
)

queries_embeddings = model.encode(
    queries,
    is_query=True,
)

documents_embeddings = model.encode(
    documents,
    is_query=False,
)

reranked_documents = rank.rerank(
    documents_ids=documents_ids,
    queries_embeddings=queries_embeddings,
    documents_embeddings=documents_embeddings,
)

Evaluation

Metrics

Col BERTTriplet

  • Evaluated with pylate.evaluation.colbert_triplet.ColBERTTripletEvaluator
Metric Value
accuracy 0.4774

Training Details

Training Dataset

Unnamed Dataset

  • Size: 5,679,484 training samples
  • Columns: question, answer, and negative
  • Approximate statistics based on the first 1000 samples:
    question answer negative
    type string string string
    details
    • min: 9 tokens
    • mean: 12.97 tokens
    • max: 21 tokens
    • min: 20 tokens
    • mean: 31.84 tokens
    • max: 32 tokens
    • min: 16 tokens
    • mean: 31.64 tokens
    • max: 32 tokens
  • Samples:
    question answer negative
    can i use bluetooth headphones for xbox one? Headsets cannot be connected to any third party wireless controller. Headsets need to be connected to the Xbox one controller in order to function. The Xbox one console doesn't have a Bluetooth feature. Hence the headsets cannot be connected via Bluetooth. You can connect Bluetooth headphones to a PS4, but only if they are compatible with the PS4. Most standard Bluetooth headphones are not compatible with the PS4, so you will need to make sure you have Bluetooth headphones that are specifically geared to the PS4.
    can i use bluetooth headphones for xbox one? Headsets cannot be connected to any third party wireless controller. Headsets need to be connected to the Xbox one controller in order to function. The Xbox one console doesn't have a Bluetooth feature. Hence the headsets cannot be connected via Bluetooth. Summary – how to pair Sony Bluetooth headphones Tap and hold the Power button on the headphones for 7 seconds to put your Sony Bluetooth headphones into pairing mode. Tap the Settings icon on your iPhone. Select the Bluetooth option. Select your headphones from the list of devices, then wait for it to say “Connected.”
    can i use bluetooth headphones for xbox one? Headsets cannot be connected to any third party wireless controller. Headsets need to be connected to the Xbox one controller in order to function. The Xbox one console doesn't have a Bluetooth feature. Hence the headsets cannot be connected via Bluetooth. You can only pair one Bluetooth headphone or soundbar and one other Bluetooth device to the TV at the same time, but not two Bluetooth headphones or soundbars at the same time.
  • Loss: pylate.losses.contrastive.Contrastive

Evaluation Dataset

Unnamed Dataset

  • Size: 5,000 evaluation samples
  • Columns: question, answer, and negative_1
  • Approximate statistics based on the first 1000 samples:
    question answer negative_1
    type string string string
    details
    • min: 9 tokens
    • mean: 12.83 tokens
    • max: 23 tokens
    • min: 13 tokens
    • mean: 31.71 tokens
    • max: 32 tokens
    • min: 11 tokens
    • mean: 31.37 tokens
    • max: 32 tokens
  • Samples:
    question answer negative_1
    what is controlled by the peripheral nervous system? The efferent nerves of the somatic nervous system of the PNS is responsible for voluntary, conscious control of skeletal muscles (effector organ) using motor (efferent) nerves. The efferent nerves of the autonomic (visceral) nervous system control the visceral functions of the body. Which of the following is not a part of peripheral nervous system? Explanation: Peripheral nervous system lies outside the brain and spinal cord. Spinal cord is not a part of peripheral nervous system.
    is cold water good to drink in the morning? This is probably because drinking cold water makes it easier for your body to maintain a lower core temperature. Drinking plain water, no matter the temperature, has been proven to give your body more energy throughout the day. What Are Benefits of It Cold? Drinking water cold is beneficial because it tastes better and you are more likely to drink more of it. Cold lemon water tastes delicious and so you are more likely to drink more of it.
    how to get rid of fungal nail quickly? According to a 2016 review, thymol has antifungal and antibacterial properties. To treat toenail fungus, apply oregano oil to the affected nail twice daily with a cotton swab. Some people use oregano oil and tea tree oil together. With treatment, many people can get rid of nail fungus. Even when the fungus clears, your nail(s) may look unhealthy until the infected nail grows out. A fingernail grows out in 4 to 6 months and a toenail in 12 to 18 months.
  • Loss: pylate.losses.contrastive.Contrastive

Training Hyperparameters

Non-Default Hyperparameters

  • eval_strategy: steps
  • per_device_train_batch_size: 128
  • per_device_eval_batch_size: 128
  • learning_rate: 3e-06
  • num_train_epochs: 1
  • warmup_ratio: 0.1
  • seed: 12
  • bf16: True
  • dataloader_num_workers: 12
  • load_best_model_at_end: True

All Hyperparameters

Click to expand
  • overwrite_output_dir: False
  • do_predict: False
  • eval_strategy: steps
  • prediction_loss_only: True
  • per_device_train_batch_size: 128
  • per_device_eval_batch_size: 128
  • per_gpu_train_batch_size: None
  • per_gpu_eval_batch_size: None
  • gradient_accumulation_steps: 1
  • eval_accumulation_steps: None
  • torch_empty_cache_steps: None
  • learning_rate: 3e-06
  • weight_decay: 0.0
  • adam_beta1: 0.9
  • adam_beta2: 0.999
  • adam_epsilon: 1e-08
  • max_grad_norm: 1.0
  • num_train_epochs: 1
  • max_steps: -1
  • lr_scheduler_type: linear
  • lr_scheduler_kwargs: {}
  • warmup_ratio: 0.1
  • warmup_steps: 0
  • log_level: passive
  • log_level_replica: warning
  • log_on_each_node: True
  • logging_nan_inf_filter: True
  • save_safetensors: True
  • save_on_each_node: False
  • save_only_model: False
  • restore_callback_states_from_checkpoint: False
  • no_cuda: False
  • use_cpu: False
  • use_mps_device: False
  • seed: 12
  • data_seed: None
  • jit_mode_eval: False
  • use_ipex: False
  • bf16: True
  • fp16: False
  • fp16_opt_level: O1
  • half_precision_backend: auto
  • bf16_full_eval: False
  • fp16_full_eval: False
  • tf32: None
  • local_rank: 0
  • ddp_backend: None
  • tpu_num_cores: None
  • tpu_metrics_debug: False
  • debug: []
  • dataloader_drop_last: False
  • dataloader_num_workers: 12
  • dataloader_prefetch_factor: None
  • past_index: -1
  • disable_tqdm: False
  • remove_unused_columns: True
  • label_names: None
  • load_best_model_at_end: True
  • ignore_data_skip: False
  • fsdp: []
  • fsdp_min_num_params: 0
  • fsdp_config: {'min_num_params': 0, 'xla': False, 'xla_fsdp_v2': False, 'xla_fsdp_grad_ckpt': False}
  • fsdp_transformer_layer_cls_to_wrap: None
  • accelerator_config: {'split_batches': False, 'dispatch_batches': None, 'even_batches': True, 'use_seedable_sampler': True, 'non_blocking': False, 'gradient_accumulation_kwargs': None}
  • deepspeed: None
  • label_smoothing_factor: 0.0
  • optim: adamw_torch
  • optim_args: None
  • adafactor: False
  • group_by_length: False
  • length_column_name: length
  • ddp_find_unused_parameters: None
  • ddp_bucket_cap_mb: None
  • ddp_broadcast_buffers: False
  • dataloader_pin_memory: True
  • dataloader_persistent_workers: False
  • skip_memory_metrics: True
  • use_legacy_prediction_loop: False
  • push_to_hub: False
  • resume_from_checkpoint: None
  • hub_model_id: None
  • hub_strategy: every_save
  • hub_private_repo: None
  • hub_always_push: False
  • gradient_checkpointing: False
  • gradient_checkpointing_kwargs: None
  • include_inputs_for_metrics: False
  • include_for_metrics: []
  • eval_do_concat_batches: True
  • fp16_backend: auto
  • push_to_hub_model_id: None
  • push_to_hub_organization: None
  • mp_parameters:
  • auto_find_batch_size: False
  • full_determinism: False
  • torchdynamo: None
  • ray_scope: last
  • ddp_timeout: 1800
  • torch_compile: False
  • torch_compile_backend: None
  • torch_compile_mode: None
  • dispatch_batches: None
  • split_batches: None
  • include_tokens_per_second: False
  • include_num_input_tokens_seen: False
  • neftune_noise_alpha: None
  • optim_target_modules: None
  • batch_eval_metrics: False
  • eval_on_start: False
  • use_liger_kernel: False
  • eval_use_gather_object: False
  • average_tokens_across_devices: False
  • prompts: None
  • batch_sampler: batch_sampler
  • multi_dataset_batch_sampler: proportional

Training Logs

Click to expand
Epoch Step Training Loss Validation Loss accuracy
0 0 - - 0.3568
0.0000 1 10.0066 - -
0.0045 200 9.562 - -
0.0090 400 8.6749 - -
0.0135 600 6.7475 - -
0.0180 800 4.9203 - -
0.0225 1000 3.4444 - -
0.0270 1200 2.5604 - -
0.0316 1400 2.1878 - -
0.0361 1600 1.9166 - -
0.0406 1800 1.7376 - -
0.0451 2000 1.5786 - -
0.0496 2200 1.4304 - -
0.0541 2400 1.3307 - -
0.0586 2600 1.2409 - -
0.0631 2800 1.1913 - -
0.0676 3000 1.0885 - -
0.0721 3200 1.0439 - -
0.0766 3400 0.9721 - -
0.0811 3600 0.918 - -
0.0856 3800 0.8688 - -
0.0901 4000 0.8269 - -
0.0947 4200 0.7815 - -
0.0992 4400 0.7577 - -
0.1037 4600 0.714 - -
0.1082 4800 0.6923 - -
0.1127 5000 0.6619 - -
0.1172 5200 0.6409 - -
0.1217 5400 0.6142 - -
0.1262 5600 0.6163 - -
0.1307 5800 0.5821 - -
0.1352 6000 0.5822 - -
0.1397 6200 0.5572 - -
0.1442 6400 0.555 - -
0.1487 6600 0.5392 - -
0.1533 6800 0.5326 - -
0.1578 7000 0.5185 - -
0.1623 7200 0.507 - -
0.1668 7400 0.4943 - -
0.1713 7600 0.4915 - -
0.1758 7800 0.4951 - -
0.1803 8000 0.4806 - -
0.1848 8200 0.4782 - -
0.1893 8400 0.4719 - -
0.1938 8600 0.4628 - -
0.1983 8800 0.4615 - -
0.2028 9000 0.4624 - -
0.2073 9200 0.4462 - -
0.2119 9400 0.4571 - -
0.2164 9600 0.452 - -
0.2209 9800 0.4454 - -
0.2254 10000 0.4387 - -
0.2299 10200 0.4247 - -
0.2344 10400 0.4221 - -
0.2389 10600 0.4242 - -
0.2434 10800 0.422 - -
0.2479 11000 0.4252 - -
0.2524 11200 0.416 - -
0.2569 11400 0.4138 - -
0.2614 11600 0.4139 - -
0.2659 11800 0.4168 - -
0.2704 12000 0.4008 - -
0.2750 12200 0.3994 - -
0.2795 12400 0.3973 - -
0.2840 12600 0.393 - -
0.2885 12800 0.3863 - -
0.2930 13000 0.3914 - -
0.2975 13200 0.38 - -
0.3020 13400 0.3805 - -
0.3065 13600 0.3749 - -
0.3110 13800 0.3814 - -
0.3155 14000 0.3783 - -
0.3200 14200 0.3733 - -
0.3245 14400 0.3762 - -
0.3290 14600 0.3797 - -
0.3336 14800 0.3727 - -
0.3381 15000 0.3658 - -
0.3426 15200 0.3655 - -
0.3471 15400 0.3619 - -
0.3516 15600 0.3685 - -
0.3561 15800 0.3608 - -
0.3606 16000 0.3631 - -
0.3651 16200 0.3587 - -
0.3696 16400 0.3536 - -
0.3741 16600 0.3477 - -
0.3786 16800 0.3595 - -
0.3831 17000 0.3558 - -
0.3876 17200 0.3518 - -
0.3921 17400 0.353 - -
0.3967 17600 0.354 - -
0.4012 17800 0.3477 - -
0.4057 18000 0.3457 - -
0.4102 18200 0.346 - -
0.4147 18400 0.3451 - -
0.4192 18600 0.3437 - -
0.4237 18800 0.3401 - -
0.4282 19000 0.342 - -
0.4327 19200 0.3416 - -
0.4372 19400 0.3405 - -
0.4417 19600 0.3331 - -
0.4462 19800 0.3319 - -
0.4507 20000 0.3264 - -
0 0 - - 0.4590
0.4507 20000 - 1.2902 -
0.4553 20200 0.3312 - -
0.4598 20400 0.3363 - -
0.4643 20600 0.333 - -
0.4688 20800 0.3341 - -
0.4733 21000 0.3287 - -
0.4778 21200 0.3357 - -
0.4823 21400 0.3325 - -
0.4868 21600 0.3323 - -
0.4913 21800 0.3385 - -
0.4958 22000 0.3244 - -
0.5003 22200 0.3281 - -
0.5048 22400 0.3251 - -
0.5093 22600 0.3271 - -
0.5138 22800 0.3271 - -
0.5184 23000 0.3245 - -
0.5229 23200 0.3185 - -
0.5274 23400 0.3212 - -
0.5319 23600 0.3211 - -
0.5364 23800 0.3205 - -
0.5409 24000 0.3104 - -
0.5454 24200 0.3208 - -
0.5499 24400 0.3218 - -
0.5544 24600 0.3183 - -
0.5589 24800 0.3208 - -
0.5634 25000 0.3151 - -
0.5679 25200 0.3138 - -
0.5724 25400 0.3155 - -
0.5770 25600 0.3201 - -
0.5815 25800 0.3135 - -
0.5860 26000 0.3157 - -
0.5905 26200 0.3051 - -
0.5950 26400 0.3121 - -
0.5995 26600 0.3109 - -
0.6040 26800 0.3103 - -
0.6085 27000 0.316 - -
0.6130 27200 0.3119 - -
0.6175 27400 0.3135 - -
0.6220 27600 0.3007 - -
0.6265 27800 0.304 - -
0.6310 28000 0.3014 - -
0.6356 28200 0.3075 - -
0.6401 28400 0.3074 - -
0.6446 28600 0.3072 - -
0.6491 28800 0.3043 - -
0.6536 29000 0.3059 - -
0.6581 29200 0.3054 - -
0.6626 29400 0.3019 - -
0.6671 29600 0.3108 - -
0.6716 29800 0.3032 - -
0.6761 30000 0.3054 - -
0.6806 30200 0.3034 - -
0.6851 30400 0.3008 - -
0.6896 30600 0.3 - -
0.6941 30800 0.3042 - -
0.6987 31000 0.3018 - -
0.7032 31200 0.3162 - -
0.7077 31400 0.2998 - -
0.7122 31600 0.2975 - -
0.7167 31800 0.3015 - -
0.7212 32000 0.3005 - -
0.7257 32200 0.3028 - -
0.7302 32400 0.3029 - -
0.7347 32600 0.2968 - -
0.7392 32800 0.3066 - -
0.7437 33000 0.2958 - -
0.7482 33200 0.2968 - -
0.7527 33400 0.2963 - -
0.7573 33600 0.3026 - -
0.7618 33800 0.2891 - -
0.7663 34000 0.2991 - -
0.7708 34200 0.2939 - -
0.7753 34400 0.2923 - -
0.7798 34600 0.295 - -
0.7843 34800 0.2901 - -
0.7888 35000 0.294 - -
0.7933 35200 0.2945 - -
0.7978 35400 0.299 - -
0.8023 35600 0.297 - -
0.8068 35800 0.2881 - -
0.8113 36000 0.298 - -
0.8158 36200 0.2925 - -
0.8204 36400 0.2978 - -
0.8249 36600 0.2989 - -
0.8294 36800 0.2914 - -
0.8339 37000 0.2913 - -
0.8384 37200 0.2925 - -
0.8429 37400 0.2991 - -
0.8474 37600 0.291 - -
0.8519 37800 0.2937 - -
0.8564 38000 0.2989 - -
0.8609 38200 0.2854 - -
0.8654 38400 0.2878 - -
0.8699 38600 0.2905 - -
0.8744 38800 0.287 - -
0.8790 39000 0.2869 - -
0.8835 39200 0.2927 - -
0.8880 39400 0.2889 - -
0.8925 39600 0.2912 - -
0.8970 39800 0.2927 - -
0.9015 40000 0.2952 - -
0 0 - - 0.4774
0.9015 40000 - 1.2227 -
0.9060 40200 0.29 - -
0.9105 40400 0.2878 - -
0.9150 40600 0.2924 - -
0.9195 40800 0.2877 - -
0.9240 41000 0.2844 - -
0.9285 41200 0.2951 - -
0.9330 41400 0.291 - -
0.9375 41600 0.292 - -
0.9421 41800 0.2902 - -
0.9466 42000 0.2815 - -
0.9511 42200 0.29 - -
0.9556 42400 0.2872 - -
0.9601 42600 0.2759 - -
0.9646 42800 0.2832 - -
0.9691 43000 0.2886 - -
0.9736 43200 0.2908 - -
0.9781 43400 0.2857 - -
0.9826 43600 0.2833 - -
0.9871 43800 0.2837 - -
0.9916 44000 0.2882 - -
0.9961 44200 0.2919 - -
  • The bold row denotes the saved checkpoint.

Framework Versions

  • Python: 3.11.0
  • Sentence Transformers: 4.0.1
  • PyLate: 1.1.7
  • Transformers: 4.48.2
  • PyTorch: 2.6.0+cu124
  • Accelerate: 1.6.0
  • Datasets: 3.5.0
  • Tokenizers: 0.21.1

Citation

BibTeX

Sentence Transformers

@inproceedings{reimers-2019-sentence-bert,
    title = "Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks",
    author = "Reimers, Nils and Gurevych, Iryna",
    booktitle = "Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing",
    month = "11",
    year = "2019",
    publisher = "Association for Computational Linguistics",
    url = "https://arxiv.org/abs/1908.10084"
}

PyLate

@misc{PyLate,
title={PyLate: Flexible Training and Retrieval for Late Interaction Models},
author={Chaffin, Antoine and Sourty, Raphaël},
url={https://github.com/lightonai/pylate},
year={2024}
}
Downloads last month
4
Safetensors
Model size
22.7M params
Tensor type
F32
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for ayushexel/colbert-MiniLM-L6-H384-uncased-3-neg-1-epoch-gooaq-1995000

Finetuned
(13)
this model

Evaluation results