PyLate model based on nreimers/MiniLM-L6-H384-uncased

This is a PyLate model finetuned from nreimers/MiniLM-L6-H384-uncased. It maps sentences & paragraphs to sequences of 128-dimensional dense vectors and can be used for semantic textual similarity using the MaxSim operator.

Model Details

Model Description

Model Type: PyLate model
Base model: nreimers/MiniLM-L6-H384-uncased
Document Length: 180 tokens
Query Length: 32 tokens
Output Dimensionality: 128 tokens
Similarity Function: MaxSim

Model Sources

Documentation: PyLate Documentation
Repository: PyLate on GitHub
Hugging Face: PyLate models on Hugging Face

Full Model Architecture

ColBERT(
  (0): Transformer({'max_seq_length': 179, 'do_lower_case': False}) with Transformer model: BertModel 
  (1): Dense({'in_features': 384, 'out_features': 128, 'bias': False, 'activation_function': 'torch.nn.modules.linear.Identity'})
)

Usage

First install the PyLate library:

pip install -U pylate

Retrieval

PyLate provides a streamlined interface to index and retrieve documents using ColBERT models. The index leverages the Voyager HNSW index to efficiently handle document embeddings and enable fast retrieval.

Indexing documents

First, load the ColBERT model and initialize the Voyager index, then encode and index your documents:

from pylate import indexes, models, retrieve

# Step 1: Load the ColBERT model
model = models.ColBERT(
    model_name_or_path=ayushexel/colbert-MiniLM-L6-H384-uncased-3-neg-1-epoch-gooaq-1995000,
)

# Step 2: Initialize the Voyager index
index = indexes.Voyager(
    index_folder="pylate-index",
    index_name="index",
    override=True,  # This overwrites the existing index if any
)

# Step 3: Encode the documents
documents_ids = ["1", "2", "3"]
documents = ["document 1 text", "document 2 text", "document 3 text"]

documents_embeddings = model.encode(
    documents,
    batch_size=32,
    is_query=False,  # Ensure that it is set to False to indicate that these are documents, not queries
    show_progress_bar=True,
)

# Step 4: Add document embeddings to the index by providing embeddings and corresponding ids
index.add_documents(
    documents_ids=documents_ids,
    documents_embeddings=documents_embeddings,
)

Note that you do not have to recreate the index and encode the documents every time. Once you have created an index and added the documents, you can re-use the index later by loading it:

# To load an index, simply instantiate it with the correct folder/name and without overriding it
index = indexes.Voyager(
    index_folder="pylate-index",
    index_name="index",
)

Retrieving top-k documents for queries

Once the documents are indexed, you can retrieve the top-k most relevant documents for a given set of queries. To do so, initialize the ColBERT retriever with the index you want to search in, encode the queries and then retrieve the top-k documents to get the top matches ids and relevance scores:

# Step 1: Initialize the ColBERT retriever
retriever = retrieve.ColBERT(index=index)

# Step 2: Encode the queries
queries_embeddings = model.encode(
    ["query for document 3", "query for document 1"],
    batch_size=32,
    is_query=True,  #  # Ensure that it is set to False to indicate that these are queries
    show_progress_bar=True,
)

# Step 3: Retrieve top-k documents
scores = retriever.retrieve(
    queries_embeddings=queries_embeddings,
    k=10,  # Retrieve the top 10 matches for each query
)

Reranking

If you only want to use the ColBERT model to perform reranking on top of your first-stage retrieval pipeline without building an index, you can simply use rank function and pass the queries and documents to rerank:

from pylate import rank, models

queries = [
    "query A",
    "query B",
]

documents = [
    ["document A", "document B"],
    ["document 1", "document C", "document B"],
]

documents_ids = [
    [1, 2],
    [1, 3, 2],
]

model = models.ColBERT(
    model_name_or_path=ayushexel/colbert-MiniLM-L6-H384-uncased-3-neg-1-epoch-gooaq-1995000,
)

queries_embeddings = model.encode(
    queries,
    is_query=True,
)

documents_embeddings = model.encode(
    documents,
    is_query=False,
)

reranked_documents = rank.rerank(
    documents_ids=documents_ids,
    queries_embeddings=queries_embeddings,
    documents_embeddings=documents_embeddings,
)

Evaluation

Metrics

Col BERTTriplet

Evaluated with pylate.evaluation.colbert_triplet.ColBERTTripletEvaluator

Metric	Value
accuracy	0.4774

Training Details

Training Dataset

Unnamed Dataset

Size: 5,679,484 training samples
Columns: question, answer, and negative

Approximate statistics based on the first 1000 samples:

	question	answer	negative
type	string	string	string
details	min: 9 tokens mean: 12.97 tokens max: 21 tokens	min: 20 tokens mean: 31.84 tokens max: 32 tokens	min: 16 tokens mean: 31.64 tokens max: 32 tokens

Samples:

question	answer	negative
`can i use bluetooth headphones for xbox one?`	`Headsets cannot be connected to any third party wireless controller. Headsets need to be connected to the Xbox one controller in order to function. The Xbox one console doesn't have a Bluetooth feature. Hence the headsets cannot be connected via Bluetooth.`	`You can connect Bluetooth headphones to a PS4, but only if they are compatible with the PS4. Most standard Bluetooth headphones are not compatible with the PS4, so you will need to make sure you have Bluetooth headphones that are specifically geared to the PS4.`
`can i use bluetooth headphones for xbox one?`	`Headsets cannot be connected to any third party wireless controller. Headsets need to be connected to the Xbox one controller in order to function. The Xbox one console doesn't have a Bluetooth feature. Hence the headsets cannot be connected via Bluetooth.`	`Summary – how to pair Sony Bluetooth headphones Tap and hold the Power button on the headphones for 7 seconds to put your Sony Bluetooth headphones into pairing mode. Tap the Settings icon on your iPhone. Select the Bluetooth option. Select your headphones from the list of devices, then wait for it to say “Connected.”`
`can i use bluetooth headphones for xbox one?`	`Headsets cannot be connected to any third party wireless controller. Headsets need to be connected to the Xbox one controller in order to function. The Xbox one console doesn't have a Bluetooth feature. Hence the headsets cannot be connected via Bluetooth.`	`You can only pair one Bluetooth headphone or soundbar and one other Bluetooth device to the TV at the same time, but not two Bluetooth headphones or soundbars at the same time.`

Loss: pylate.losses.contrastive.Contrastive

Evaluation Dataset

Unnamed Dataset

Size: 5,000 evaluation samples
Columns: question, answer, and negative_1

Approximate statistics based on the first 1000 samples:

	question	answer	negative_1
type	string	string	string
details	min: 9 tokens mean: 12.83 tokens max: 23 tokens	min: 13 tokens mean: 31.71 tokens max: 32 tokens	min: 11 tokens mean: 31.37 tokens max: 32 tokens

Samples:

question	answer	negative_1
`what is controlled by the peripheral nervous system?`	`The efferent nerves of the somatic nervous system of the PNS is responsible for voluntary, conscious control of skeletal muscles (effector organ) using motor (efferent) nerves. The efferent nerves of the autonomic (visceral) nervous system control the visceral functions of the body.`	`Which of the following is not a part of peripheral nervous system? Explanation: Peripheral nervous system lies outside the brain and spinal cord. Spinal cord is not a part of peripheral nervous system.`
`is cold water good to drink in the morning?`	`This is probably because drinking cold water makes it easier for your body to maintain a lower core temperature. Drinking plain water, no matter the temperature, has been proven to give your body more energy throughout the day.`	`What Are Benefits of It Cold? Drinking water cold is beneficial because it tastes better and you are more likely to drink more of it. Cold lemon water tastes delicious and so you are more likely to drink more of it.`
`how to get rid of fungal nail quickly?`	`According to a 2016 review, thymol has antifungal and antibacterial properties. To treat toenail fungus, apply oregano oil to the affected nail twice daily with a cotton swab. Some people use oregano oil and tea tree oil together.`	`With treatment, many people can get rid of nail fungus. Even when the fungus clears, your nail(s) may look unhealthy until the infected nail grows out. A fingernail grows out in 4 to 6 months and a toenail in 12 to 18 months.`

Loss: pylate.losses.contrastive.Contrastive

Training Hyperparameters

Non-Default Hyperparameters

eval_strategy: steps
per_device_train_batch_size: 128
per_device_eval_batch_size: 128
learning_rate: 3e-06
num_train_epochs: 1
warmup_ratio: 0.1
seed: 12
bf16: True
dataloader_num_workers: 12
load_best_model_at_end: True

All Hyperparameters

Click to expand

overwrite_output_dir: False
do_predict: False
eval_strategy: steps
prediction_loss_only: True
per_device_train_batch_size: 128
per_device_eval_batch_size: 128
per_gpu_train_batch_size: None
per_gpu_eval_batch_size: None
gradient_accumulation_steps: 1
eval_accumulation_steps: None
torch_empty_cache_steps: None
learning_rate: 3e-06
weight_decay: 0.0
adam_beta1: 0.9
adam_beta2: 0.999
adam_epsilon: 1e-08
max_grad_norm: 1.0
num_train_epochs: 1
max_steps: -1
lr_scheduler_type: linear
lr_scheduler_kwargs: {}
warmup_ratio: 0.1
warmup_steps: 0
log_level: passive
log_level_replica: warning
log_on_each_node: True
logging_nan_inf_filter: True
save_safetensors: True
save_on_each_node: False
save_only_model: False
restore_callback_states_from_checkpoint: False
no_cuda: False
use_cpu: False
use_mps_device: False
seed: 12
data_seed: None
jit_mode_eval: False
use_ipex: False
bf16: True
fp16: False
fp16_opt_level: O1
half_precision_backend: auto
bf16_full_eval: False
fp16_full_eval: False
tf32: None
local_rank: 0
ddp_backend: None
tpu_num_cores: None
tpu_metrics_debug: False
debug: []
dataloader_drop_last: False
dataloader_num_workers: 12
dataloader_prefetch_factor: None
past_index: -1
disable_tqdm: False
remove_unused_columns: True
label_names: None
load_best_model_at_end: True
ignore_data_skip: False
fsdp: []
fsdp_min_num_params: 0
fsdp_config: {'min_num_params': 0, 'xla': False, 'xla_fsdp_v2': False, 'xla_fsdp_grad_ckpt': False}
fsdp_transformer_layer_cls_to_wrap: None
accelerator_config: {'split_batches': False, 'dispatch_batches': None, 'even_batches': True, 'use_seedable_sampler': True, 'non_blocking': False, 'gradient_accumulation_kwargs': None}
deepspeed: None
label_smoothing_factor: 0.0
optim: adamw_torch
optim_args: None
adafactor: False
group_by_length: False
length_column_name: length
ddp_find_unused_parameters: None
ddp_bucket_cap_mb: None
ddp_broadcast_buffers: False
dataloader_pin_memory: True
dataloader_persistent_workers: False
skip_memory_metrics: True
use_legacy_prediction_loop: False
push_to_hub: False
resume_from_checkpoint: None
hub_model_id: None
hub_strategy: every_save
hub_private_repo: None
hub_always_push: False
gradient_checkpointing: False
gradient_checkpointing_kwargs: None
include_inputs_for_metrics: False
include_for_metrics: []
eval_do_concat_batches: True
fp16_backend: auto
push_to_hub_model_id: None
push_to_hub_organization: None
mp_parameters:
auto_find_batch_size: False
full_determinism: False
torchdynamo: None
ray_scope: last
ddp_timeout: 1800
torch_compile: False
torch_compile_backend: None
torch_compile_mode: None
dispatch_batches: None
split_batches: None
include_tokens_per_second: False
include_num_input_tokens_seen: False
neftune_noise_alpha: None
optim_target_modules: None
batch_eval_metrics: False
eval_on_start: False
use_liger_kernel: False
eval_use_gather_object: False
average_tokens_across_devices: False
prompts: None
batch_sampler: batch_sampler
multi_dataset_batch_sampler: proportional

Training Logs

Click to expand

Epoch	Step	Training Loss	Validation Loss	accuracy
0	0	-	-	0.3568
0.0000	1	10.0066	-	-
0.0045	200	9.562	-	-
0.0090	400	8.6749	-	-
0.0135	600	6.7475	-	-
0.0180	800	4.9203	-	-
0.0225	1000	3.4444	-	-
0.0270	1200	2.5604	-	-
0.0316	1400	2.1878	-	-
0.0361	1600	1.9166	-	-
0.0406	1800	1.7376	-	-
0.0451	2000	1.5786	-	-
0.0496	2200	1.4304	-	-
0.0541	2400	1.3307	-	-
0.0586	2600	1.2409	-	-
0.0631	2800	1.1913	-	-
0.0676	3000	1.0885	-	-
0.0721	3200	1.0439	-	-
0.0766	3400	0.9721	-	-
0.0811	3600	0.918	-	-
0.0856	3800	0.8688	-	-
0.0901	4000	0.8269	-	-
0.0947	4200	0.7815	-	-
0.0992	4400	0.7577	-	-
0.1037	4600	0.714	-	-
0.1082	4800	0.6923	-	-
0.1127	5000	0.6619	-	-
0.1172	5200	0.6409	-	-
0.1217	5400	0.6142	-	-
0.1262	5600	0.6163	-	-
0.1307	5800	0.5821	-	-
0.1352	6000	0.5822	-	-
0.1397	6200	0.5572	-	-
0.1442	6400	0.555	-	-
0.1487	6600	0.5392	-	-
0.1533	6800	0.5326	-	-
0.1578	7000	0.5185	-	-
0.1623	7200	0.507	-	-
0.1668	7400	0.4943	-	-
0.1713	7600	0.4915	-	-
0.1758	7800	0.4951	-	-
0.1803	8000	0.4806	-	-
0.1848	8200	0.4782	-	-
0.1893	8400	0.4719	-	-
0.1938	8600	0.4628	-	-
0.1983	8800	0.4615	-	-
0.2028	9000	0.4624	-	-
0.2073	9200	0.4462	-	-
0.2119	9400	0.4571	-	-
0.2164	9600	0.452	-	-
0.2209	9800	0.4454	-	-
0.2254	10000	0.4387	-	-
0.2299	10200	0.4247	-	-
0.2344	10400	0.4221	-	-
0.2389	10600	0.4242	-	-
0.2434	10800	0.422	-	-
0.2479	11000	0.4252	-	-
0.2524	11200	0.416	-	-
0.2569	11400	0.4138	-	-
0.2614	11600	0.4139	-	-
0.2659	11800	0.4168	-	-
0.2704	12000	0.4008	-	-
0.2750	12200	0.3994	-	-
0.2795	12400	0.3973	-	-
0.2840	12600	0.393	-	-
0.2885	12800	0.3863	-	-
0.2930	13000	0.3914	-	-
0.2975	13200	0.38	-	-
0.3020	13400	0.3805	-	-
0.3065	13600	0.3749	-	-
0.3110	13800	0.3814	-	-
0.3155	14000	0.3783	-	-
0.3200	14200	0.3733	-	-
0.3245	14400	0.3762	-	-
0.3290	14600	0.3797	-	-
0.3336	14800	0.3727	-	-
0.3381	15000	0.3658	-	-
0.3426	15200	0.3655	-	-
0.3471	15400	0.3619	-	-
0.3516	15600	0.3685	-	-
0.3561	15800	0.3608	-	-
0.3606	16000	0.3631	-	-
0.3651	16200	0.3587	-	-
0.3696	16400	0.3536	-	-
0.3741	16600	0.3477	-	-
0.3786	16800	0.3595	-	-
0.3831	17000	0.3558	-	-
0.3876	17200	0.3518	-	-
0.3921	17400	0.353	-	-
0.3967	17600	0.354	-	-
0.4012	17800	0.3477	-	-
0.4057	18000	0.3457	-	-
0.4102	18200	0.346	-	-
0.4147	18400	0.3451	-	-
0.4192	18600	0.3437	-	-
0.4237	18800	0.3401	-	-
0.4282	19000	0.342	-	-
0.4327	19200	0.3416	-	-
0.4372	19400	0.3405	-	-
0.4417	19600	0.3331	-	-
0.4462	19800	0.3319	-	-
0.4507	20000	0.3264	-	-
0	0	-	-	0.4590
0.4507	20000	-	1.2902	-
0.4553	20200	0.3312	-	-
0.4598	20400	0.3363	-	-
0.4643	20600	0.333	-	-
0.4688	20800	0.3341	-	-
0.4733	21000	0.3287	-	-
0.4778	21200	0.3357	-	-
0.4823	21400	0.3325	-	-
0.4868	21600	0.3323	-	-
0.4913	21800	0.3385	-	-
0.4958	22000	0.3244	-	-
0.5003	22200	0.3281	-	-
0.5048	22400	0.3251	-	-
0.5093	22600	0.3271	-	-
0.5138	22800	0.3271	-	-
0.5184	23000	0.3245	-	-
0.5229	23200	0.3185	-	-
0.5274	23400	0.3212	-	-
0.5319	23600	0.3211	-	-
0.5364	23800	0.3205	-	-
0.5409	24000	0.3104	-	-
0.5454	24200	0.3208	-	-
0.5499	24400	0.3218	-	-
0.5544	24600	0.3183	-	-
0.5589	24800	0.3208	-	-
0.5634	25000	0.3151	-	-
0.5679	25200	0.3138	-	-
0.5724	25400	0.3155	-	-
0.5770	25600	0.3201	-	-
0.5815	25800	0.3135	-	-
0.5860	26000	0.3157	-	-
0.5905	26200	0.3051	-	-
0.5950	26400	0.3121	-	-
0.5995	26600	0.3109	-	-
0.6040	26800	0.3103	-	-
0.6085	27000	0.316	-	-
0.6130	27200	0.3119	-	-
0.6175	27400	0.3135	-	-
0.6220	27600	0.3007	-	-
0.6265	27800	0.304	-	-
0.6310	28000	0.3014	-	-
0.6356	28200	0.3075	-	-
0.6401	28400	0.3074	-	-
0.6446	28600	0.3072	-	-
0.6491	28800	0.3043	-	-
0.6536	29000	0.3059	-	-
0.6581	29200	0.3054	-	-
0.6626	29400	0.3019	-	-
0.6671	29600	0.3108	-	-
0.6716	29800	0.3032	-	-
0.6761	30000	0.3054	-	-
0.6806	30200	0.3034	-	-
0.6851	30400	0.3008	-	-
0.6896	30600	0.3	-	-
0.6941	30800	0.3042	-	-
0.6987	31000	0.3018	-	-
0.7032	31200	0.3162	-	-
0.7077	31400	0.2998	-	-
0.7122	31600	0.2975	-	-
0.7167	31800	0.3015	-	-
0.7212	32000	0.3005	-	-
0.7257	32200	0.3028	-	-
0.7302	32400	0.3029	-	-
0.7347	32600	0.2968	-	-
0.7392	32800	0.3066	-	-
0.7437	33000	0.2958	-	-
0.7482	33200	0.2968	-	-
0.7527	33400	0.2963	-	-
0.7573	33600	0.3026	-	-
0.7618	33800	0.2891	-	-
0.7663	34000	0.2991	-	-
0.7708	34200	0.2939	-	-
0.7753	34400	0.2923	-	-
0.7798	34600	0.295	-	-
0.7843	34800	0.2901	-	-
0.7888	35000	0.294	-	-
0.7933	35200	0.2945	-	-
0.7978	35400	0.299	-	-
0.8023	35600	0.297	-	-
0.8068	35800	0.2881	-	-
0.8113	36000	0.298	-	-
0.8158	36200	0.2925	-	-
0.8204	36400	0.2978	-	-
0.8249	36600	0.2989	-	-
0.8294	36800	0.2914	-	-
0.8339	37000	0.2913	-	-
0.8384	37200	0.2925	-	-
0.8429	37400	0.2991	-	-
0.8474	37600	0.291	-	-
0.8519	37800	0.2937	-	-
0.8564	38000	0.2989	-	-
0.8609	38200	0.2854	-	-
0.8654	38400	0.2878	-	-
0.8699	38600	0.2905	-	-
0.8744	38800	0.287	-	-
0.8790	39000	0.2869	-	-
0.8835	39200	0.2927	-	-
0.8880	39400	0.2889	-	-
0.8925	39600	0.2912	-	-
0.8970	39800	0.2927	-	-
0.9015	40000	0.2952	-	-
0	0	-	-	0.4774
0.9015	40000	-	1.2227	-
0.9060	40200	0.29	-	-
0.9105	40400	0.2878	-	-
0.9150	40600	0.2924	-	-
0.9195	40800	0.2877	-	-
0.9240	41000	0.2844	-	-
0.9285	41200	0.2951	-	-
0.9330	41400	0.291	-	-
0.9375	41600	0.292	-	-
0.9421	41800	0.2902	-	-
0.9466	42000	0.2815	-	-
0.9511	42200	0.29	-	-
0.9556	42400	0.2872	-	-
0.9601	42600	0.2759	-	-
0.9646	42800	0.2832	-	-
0.9691	43000	0.2886	-	-
0.9736	43200	0.2908	-	-
0.9781	43400	0.2857	-	-
0.9826	43600	0.2833	-	-
0.9871	43800	0.2837	-	-
0.9916	44000	0.2882	-	-
0.9961	44200	0.2919	-	-

The bold row denotes the saved checkpoint.

Framework Versions

Python: 3.11.0
Sentence Transformers: 4.0.1
PyLate: 1.1.7
Transformers: 4.48.2
PyTorch: 2.6.0+cu124
Accelerate: 1.6.0
Datasets: 3.5.0
Tokenizers: 0.21.1

Citation

BibTeX

Sentence Transformers

@inproceedings{reimers-2019-sentence-bert,
    title = "Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks",
    author = "Reimers, Nils and Gurevych, Iryna",
    booktitle = "Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing",
    month = "11",
    year = "2019",
    publisher = "Association for Computational Linguistics",
    url = "https://arxiv.org/abs/1908.10084"
}

PyLate

@misc{PyLate,
title={PyLate: Flexible Training and Retrieval for Late Interaction Models},
author={Chaffin, Antoine and Sourty, Raphaël},
url={https://github.com/lightonai/pylate},
year={2024}
}

ayushexel
/

colbert-MiniLM-L6-H384-uncased-3-neg-1-epoch-gooaq-1995000

PyLate model based on nreimers/MiniLM-L6-H384-uncased

Model Details

Model Description

Model Sources

Full Model Architecture

Usage

Retrieval

Indexing documents

Retrieving top-k documents for queries

Reranking

Evaluation

Metrics

Col BERTTriplet

Training Details

Training Dataset

Unnamed Dataset

Evaluation Dataset

Unnamed Dataset

Training Hyperparameters

Non-Default Hyperparameters

All Hyperparameters

Training Logs

Framework Versions

Citation

BibTeX

Sentence Transformers

PyLate

Model tree for ayushexel/colbert-MiniLM-L6-H384-uncased-3-neg-1-epoch-gooaq-1995000

Evaluation results