PyLate model based on jhu-clsp/ettin-encoder-17m

This is a PyLate model finetuned from jhu-clsp/ettin-encoder-17m. It maps sentences & paragraphs to sequences of 128-dimensional dense vectors and can be used for semantic textual similarity using the MaxSim operator.

Model Details

Model Description

  • Model Type: PyLate model
  • Base model: jhu-clsp/ettin-encoder-17m
  • Document Length: 256 tokens
  • Query Length: 32 tokens
  • Output Dimensionality: 128 tokens
  • Similarity Function: MaxSim

Model Sources

Full Model Architecture

ColBERT(
  (0): Transformer({'max_seq_length': 255, 'do_lower_case': False}) with Transformer model: ModernBertModel 
  (1): Dense({'in_features': 256, 'out_features': 128, 'bias': False, 'activation_function': 'torch.nn.modules.linear.Identity'})
)

Usage

First install the PyLate library:

pip install -U pylate

Retrieval

PyLate provides a streamlined interface to index and retrieve documents using ColBERT models. The index leverages the Voyager HNSW index to efficiently handle document embeddings and enable fast retrieval.

Indexing documents

First, load the ColBERT model and initialize the Voyager index, then encode and index your documents:

from pylate import indexes, models, retrieve

# Step 1: Load the ColBERT model
model = models.ColBERT(
    model_name_or_path=yosefw/colbert-ettin-17m,
)

# Step 2: Initialize the Voyager index
index = indexes.Voyager(
    index_folder="pylate-index",
    index_name="index",
    override=True,  # This overwrites the existing index if any
)

# Step 3: Encode the documents
documents_ids = ["1", "2", "3"]
documents = ["document 1 text", "document 2 text", "document 3 text"]

documents_embeddings = model.encode(
    documents,
    batch_size=32,
    is_query=False,  # Ensure that it is set to False to indicate that these are documents, not queries
    show_progress_bar=True,
)

# Step 4: Add document embeddings to the index by providing embeddings and corresponding ids
index.add_documents(
    documents_ids=documents_ids,
    documents_embeddings=documents_embeddings,
)

Note that you do not have to recreate the index and encode the documents every time. Once you have created an index and added the documents, you can re-use the index later by loading it:

# To load an index, simply instantiate it with the correct folder/name and without overriding it
index = indexes.Voyager(
    index_folder="pylate-index",
    index_name="index",
)

Retrieving top-k documents for queries

Once the documents are indexed, you can retrieve the top-k most relevant documents for a given set of queries. To do so, initialize the ColBERT retriever with the index you want to search in, encode the queries and then retrieve the top-k documents to get the top matches ids and relevance scores:

# Step 1: Initialize the ColBERT retriever
retriever = retrieve.ColBERT(index=index)

# Step 2: Encode the queries
queries_embeddings = model.encode(
    ["query for document 3", "query for document 1"],
    batch_size=32,
    is_query=True,  #  # Ensure that it is set to False to indicate that these are queries
    show_progress_bar=True,
)

# Step 3: Retrieve top-k documents
scores = retriever.retrieve(
    queries_embeddings=queries_embeddings,
    k=10,  # Retrieve the top 10 matches for each query
)

Reranking

If you only want to use the ColBERT model to perform reranking on top of your first-stage retrieval pipeline without building an index, you can simply use rank function and pass the queries and documents to rerank:

from pylate import rank, models

queries = [
    "query A",
    "query B",
]

documents = [
    ["document A", "document B"],
    ["document 1", "document C", "document B"],
]

documents_ids = [
    [1, 2],
    [1, 3, 2],
]

model = models.ColBERT(
    model_name_or_path=yosefw/colbert-ettin-17m,
)

queries_embeddings = model.encode(
    queries,
    is_query=True,
)

documents_embeddings = model.encode(
    documents,
    is_query=False,
)

reranked_documents = rank.rerank(
    documents_ids=documents_ids,
    queries_embeddings=queries_embeddings,
    documents_embeddings=documents_embeddings,
)

Evaluation

Metrics

Col BERTTriplet

  • Evaluated with pylate.evaluation.colbert_triplet.ColBERTTripletEvaluator
Metric Value
accuracy 0.798

Training Details

Training Dataset

Unnamed Dataset

  • Size: 972,246 training samples
  • Columns: query, positive, negative_1, negative_2, and negative_3
  • Approximate statistics based on the first 1000 samples:
    query positive negative_1 negative_2 negative_3
    type string string string string string
    details
    • min: 5 tokens
    • mean: 10.04 tokens
    • max: 22 tokens
    • min: 21 tokens
    • mean: 31.92 tokens
    • max: 32 tokens
    • min: 19 tokens
    • mean: 31.92 tokens
    • max: 32 tokens
    • min: 19 tokens
    • mean: 31.92 tokens
    • max: 32 tokens
    • min: 22 tokens
    • mean: 31.94 tokens
    • max: 32 tokens
  • Samples:
    query positive negative_1 negative_2 negative_3
    what diseases does sugar cause Eating too much sugar raises your risk for gaining weight and the health problems that are associated with being overweight. You are more likely to suffer diabetes, heart disease, high blood pressure, cancer and many other health conditions when you indulge your sweet tooth too often.Table sugar isn’t the only culprit when it comes to sugar.ou are more likely to suffer diabetes, heart disease, high blood pressure, cancer and many other health conditions when you indulge your sweet tooth too often. Table sugar isn’t the only culprit when it comes to sugar. Sugar and Heart Disease. Lately, we’ve seen epidemiological data suggesting that increased intake of sugar-sweetened beverages increases the risk for metabolic syndrome, type 2 diabetes, coronary heart disease, and stroke (5).ugar and Heart Disease. Lately, we’ve seen epidemiological data suggesting that increased intake of sugar-sweetened beverages increases the risk for metabolic syndrome, type 2 diabetes, coronary heart disease, and stroke (5). High intake of sugar and refined carbohydrates is associated with increased risk of diabetes, metabolic syndrome, non-alcoholic fatty liver disease, lipid disorders and high blood pressure.ugar and Heart Disease. Lately, we’ve seen epidemiological data suggesting that increased intake of sugar-sweetened beverages increases the risk for metabolic syndrome, type 2 diabetes, coronary heart disease, and stroke (5). Sugar Disease is a problem that manifests in different ways in different individuals, of different ages and of different genetic susceptibility-but its three cardinal forms are:he glycemic index: key to diet for Sugar Disease. Some have proposed that persons with variants of Sugar Disease follow a diet that rigidly excludes carbohydrates, concentrating instead on meat and vegetables. In my opinion this is rarely necessary and results in dietary imbalances.
    what diseases does sugar cause Eating too much sugar raises your risk for gaining weight and the health problems that are associated with being overweight. You are more likely to suffer diabetes, heart disease, high blood pressure, cancer and many other health conditions when you indulge your sweet tooth too often.Table sugar isn’t the only culprit when it comes to sugar.ou are more likely to suffer diabetes, heart disease, high blood pressure, cancer and many other health conditions when you indulge your sweet tooth too often. Table sugar isn’t the only culprit when it comes to sugar. Another mechanism whereby sugar consumption may increase the risk of cardiovascular disease is through its effects on blood pressure. It is well known that high blood pressure increases the risk for cardiovascular disease.ugar and Heart Disease. Lately, we’ve seen epidemiological data suggesting that increased intake of sugar-sweetened beverages increases the risk for metabolic syndrome, type 2 diabetes, coronary heart disease, and stroke (5). The term Sugar Disease is a convenient catch-all for a host of modern conditions that result from an unbridled intake of sugar or refined carbohydrates coupled with a sedentary lifestyle.he glycemic index: key to diet for Sugar Disease. Some have proposed that persons with variants of Sugar Disease follow a diet that rigidly excludes carbohydrates, concentrating instead on meat and vegetables. In my opinion this is rarely necessary and results in dietary imbalances. Brown sugar is just sucrose with molasses – same basic composition. Glucose, or blood sugar, is the sugar that circulates in your blood. Fructose, or fruit sugar, is found in plants and honey. It’s the fructose in sugar that causes the problem, as you will see.That doesn’t mean you shouldn’t eat whole fruit; whole fruit contains fiber that slows digestion.It does mean that fruit juice poses a danger.t’s not a fact that sugar causes cancer, and Lustig does not become an absolute authority by virtue of his lecture going viral. It’s not the way medical science works, or should work: web virality implies popularity, not truth.
    average cost per square foot to build a house Generally, turnkey costs will start at around $70 a square foot for a starter home in ideal conditions. An average level of finish will be more like $90-100. At $110-120 we are building a custom finish. Home prices of $55, $66, $72, $80, $84, $92, $110, $118, and $328 per square foot combine to produce an average of $112 per square foot, which is probably a reasonable figure for many areas of the country. However, the difference between the lowest figure and the highest is very substantial. An average commercial steel building costs between $16 and $20 per square foot, including building package (I-Beams, purlins, girts etc.) , delivery, foundation and the cost of construction. At first glance, the average home building cost per square foot seems extremely high. People who do a lot of home improvement jobs are usually the first ones to question, because they know the cost of materials. The fact is that the materials are about 25-33 percent of the cost of a house.
  • Loss: pylate.losses.contrastive.Contrastive

Evaluation Dataset

Unnamed Dataset

  • Size: 20,000 evaluation samples
  • Columns: query, positive, negative_1, negative_2, and negative_3
  • Approximate statistics based on the first 1000 samples:
    query positive negative_1 negative_2 negative_3
    type string string string string string
    details
    • min: 5 tokens
    • mean: 10.12 tokens
    • max: 20 tokens
    • min: 28 tokens
    • mean: 31.99 tokens
    • max: 32 tokens
    • min: 19 tokens
    • mean: 31.9 tokens
    • max: 32 tokens
    • min: 19 tokens
    • mean: 31.95 tokens
    • max: 32 tokens
    • min: 19 tokens
    • mean: 31.92 tokens
    • max: 32 tokens
  • Samples:
    query positive negative_1 negative_2 negative_3
    what are the dimensions of a regulation nba backboard? What is the size of a basketball backboard? Per NBA regulations, the dimensions of a basketball backboard are 6 feet wide by 3 ½ feet high. The backboard is marked with a 2-inch white rectangle that is centered behind the ring; the rectangle's outer dimensions are 24 inches wide by 18 inches high. The basket itself consists of a metal ring with an 18-inch inner diameter and a white cord net that is 15 to 18 inches long. The Backboard and Rim: The regulation distance from the ground to the top of the rim is 10 feet for all levels of play. Regulation backboards are 6 feet wide (72 inches) by 42 inches tall. All basketball rims (hoops) are 18 inches in diameter. The inner square on the backboard is 24 inches wide by 18 inches tall. All line markings on the floor are 2 inches wide and can vary in color. 72 Backboard (Regulation) 72 x 42 backboard systems are the best choice if you're a serious basketball enthusiast or a former high school or college competitor. 1 The official, regulation size used in high school, college, and NBA competition. Ideal system for a dedicated backyard court or a large 3 car garage driveway. Each basket ring shall be securely attached to the backboard with its upper edge 10' above and parallel to the floor and equidistant from the vertical edges of the board. The nearest point of the inside edge of the ring shall be 6 from the plane of the face of the board.
    what are the dimensions of a regulation nba backboard? What is the size of a basketball backboard? Per NBA regulations, the dimensions of a basketball backboard are 6 feet wide by 3 ½ feet high. The backboard is marked with a 2-inch white rectangle that is centered behind the ring; the rectangle's outer dimensions are 24 inches wide by 18 inches high. The basket itself consists of a metal ring with an 18-inch inner diameter and a white cord net that is 15 to 18 inches long. 72 Backboard (Regulation) 72 x 42 backboard systems are the best choice if you're a serious basketball enthusiast or a former high school or college competitor. The official, regulation size used in high school, college, and NBA competition. What are the standard measurements for a basketball backboard? The standard size for a basketball backboard is 6 feet wide and 3 1/2 feet high. The surface should be flat and usually there is a 2-inch thick rectangle above the rim that is 24 inches wide and 18 inches high. The areas identified by the lane space markings are 2 by 8 inches and the neutral zone marks are 12 by 8. c. A free throw line shall be drawn (2 wide) across each of the circles indicated in the court diagram. It shall be parallel to the end line and shall be 15' from the plane of the face of the backboard.
    what is a usb receiver The Logitech Unifying receiver is a miniaturised dedicated USB wireless receiver which permits up to 6 devices such as mice and keyboards (headphones are not compatible), which must be made by Logitech and of compatible design, to be linked to the same computer using 2.4 GHz band radio communication in a way very similar to, but incompatible with ... Wireless USB. Wireless USB is a short-range, high-bandwidth wireless radio communication protocol created by the Wireless USB Promoter Group which intends to further increase the availability of general USB-based technologies. It is maintained by the WiMedia Alliance and (as of 2009) the current revision is 1.0, which was approved in 2005. Replacement USB RF receiver for current Air Mouse Elite and Air Mouse... Replacement USB RF receiver for current Air Mouse Elite and Air Mouse GO Plus products. Enables a 100-foot (30-meter) wireless range. Before ordering for GO Plus, confirm that your mouse unit is labeled AS04130. Consider purchasing two to ensure uninterrupted Air Mouse operation. VicTsing MM057 2.4G Wireless Portable Mobile Mouse Optical Mice with USB Receiver, 5 Adjustable DPI Levels, 6... by VicTsing. $ 9 99 $19.99Prime. Get it by Tomorrow, Apr 21. 50% off item with purchase of 1 items. 15% off item with purchase of 1 items. See Details.
  • Loss: pylate.losses.contrastive.Contrastive

Training Hyperparameters

Non-Default Hyperparameters

  • eval_strategy: epoch
  • per_device_train_batch_size: 64
  • per_device_eval_batch_size: 64
  • learning_rate: 8e-06
  • num_train_epochs: 4
  • lr_scheduler_type: cosine
  • warmup_ratio: 0.05
  • fp16: True
  • load_best_model_at_end: True
  • push_to_hub: True

All Hyperparameters

Click to expand
  • overwrite_output_dir: False
  • do_predict: False
  • eval_strategy: epoch
  • prediction_loss_only: True
  • per_device_train_batch_size: 64
  • per_device_eval_batch_size: 64
  • per_gpu_train_batch_size: None
  • per_gpu_eval_batch_size: None
  • gradient_accumulation_steps: 1
  • eval_accumulation_steps: None
  • torch_empty_cache_steps: None
  • learning_rate: 8e-06
  • weight_decay: 0.0
  • adam_beta1: 0.9
  • adam_beta2: 0.999
  • adam_epsilon: 1e-08
  • max_grad_norm: 1.0
  • num_train_epochs: 4
  • max_steps: -1
  • lr_scheduler_type: cosine
  • lr_scheduler_kwargs: {}
  • warmup_ratio: 0.05
  • warmup_steps: 0
  • log_level: passive
  • log_level_replica: warning
  • log_on_each_node: True
  • logging_nan_inf_filter: True
  • save_safetensors: True
  • save_on_each_node: False
  • save_only_model: False
  • restore_callback_states_from_checkpoint: False
  • no_cuda: False
  • use_cpu: False
  • use_mps_device: False
  • seed: 42
  • data_seed: None
  • jit_mode_eval: False
  • use_ipex: False
  • bf16: False
  • fp16: True
  • fp16_opt_level: O1
  • half_precision_backend: auto
  • bf16_full_eval: False
  • fp16_full_eval: False
  • tf32: None
  • local_rank: 0
  • ddp_backend: None
  • tpu_num_cores: None
  • tpu_metrics_debug: False
  • debug: []
  • dataloader_drop_last: False
  • dataloader_num_workers: 0
  • dataloader_prefetch_factor: None
  • past_index: -1
  • disable_tqdm: False
  • remove_unused_columns: True
  • label_names: None
  • load_best_model_at_end: True
  • ignore_data_skip: False
  • fsdp: []
  • fsdp_min_num_params: 0
  • fsdp_config: {'min_num_params': 0, 'xla': False, 'xla_fsdp_v2': False, 'xla_fsdp_grad_ckpt': False}
  • fsdp_transformer_layer_cls_to_wrap: None
  • accelerator_config: {'split_batches': False, 'dispatch_batches': None, 'even_batches': True, 'use_seedable_sampler': True, 'non_blocking': False, 'gradient_accumulation_kwargs': None}
  • deepspeed: None
  • label_smoothing_factor: 0.0
  • optim: adamw_torch
  • optim_args: None
  • adafactor: False
  • group_by_length: False
  • length_column_name: length
  • ddp_find_unused_parameters: None
  • ddp_bucket_cap_mb: None
  • ddp_broadcast_buffers: False
  • dataloader_pin_memory: True
  • dataloader_persistent_workers: False
  • skip_memory_metrics: True
  • use_legacy_prediction_loop: False
  • push_to_hub: True
  • resume_from_checkpoint: None
  • hub_model_id: None
  • hub_strategy: every_save
  • hub_private_repo: None
  • hub_always_push: False
  • gradient_checkpointing: False
  • gradient_checkpointing_kwargs: None
  • include_inputs_for_metrics: False
  • include_for_metrics: []
  • eval_do_concat_batches: True
  • fp16_backend: auto
  • push_to_hub_model_id: None
  • push_to_hub_organization: None
  • mp_parameters:
  • auto_find_batch_size: False
  • full_determinism: False
  • torchdynamo: None
  • ray_scope: last
  • ddp_timeout: 1800
  • torch_compile: False
  • torch_compile_backend: None
  • torch_compile_mode: None
  • dispatch_batches: None
  • split_batches: None
  • include_tokens_per_second: False
  • include_num_input_tokens_seen: False
  • neftune_noise_alpha: None
  • optim_target_modules: None
  • batch_eval_metrics: False
  • eval_on_start: False
  • use_liger_kernel: False
  • eval_use_gather_object: False
  • average_tokens_across_devices: False
  • prompts: None
  • batch_sampler: batch_sampler
  • multi_dataset_batch_sampler: proportional

Training Logs

Epoch Step Training Loss Validation Loss accuracy
1.0 15192 1.3895 - -
0 0 - - 0.7882
1.0 15192 - 1.0425 -
2.0 30384 0.9597 - -
0 0 - - 0.7972
2.0 30384 - 1.0071 -
3.0 45576 0.8756 - -
0 0 - - 0.7979
3.0 45576 - 1.0083 -
4.0 60768 0.8355 - -
0 0 - - 0.7978
4.0 60768 - 1.0145 -
0 0 - - 0.7980
  • The bold row denotes the saved checkpoint.

Framework Versions

  • Python: 3.11.11
  • Sentence Transformers: 4.0.2
  • PyLate: 1.2.0
  • Transformers: 4.48.2
  • PyTorch: 2.6.0+cu124
  • Accelerate: 1.5.2
  • Datasets: 3.6.0
  • Tokenizers: 0.21.1

Citation

BibTeX

Sentence Transformers

@inproceedings{reimers-2019-sentence-bert,
    title = "Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks",
    author = "Reimers, Nils and Gurevych, Iryna",
    booktitle = "Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing",
    month = "11",
    year = "2019",
    publisher = "Association for Computational Linguistics",
    url = "https://arxiv.org/abs/1908.10084"
}

PyLate

@misc{PyLate,
title={PyLate: Flexible Training and Retrieval for Late Interaction Models},
author={Chaffin, Antoine and Sourty, Raphaël},
url={https://github.com/lightonai/pylate},
year={2024}
}
Downloads last month
10
Safetensors
Model size
16.8M params
Tensor type
F32
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for yosefw/colbert-ettin-17m

Finetuned
(16)
this model

Evaluation results