PyLate model based on jhu-clsp/ettin-encoder-17m
This is a PyLate model finetuned from jhu-clsp/ettin-encoder-17m. It maps sentences & paragraphs to sequences of 128-dimensional dense vectors and can be used for semantic textual similarity using the MaxSim operator.
Model Details
Model Description
- Model Type: PyLate model
- Base model: jhu-clsp/ettin-encoder-17m
- Document Length: 256 tokens
- Query Length: 32 tokens
- Output Dimensionality: 128 tokens
- Similarity Function: MaxSim
Model Sources
- Documentation: PyLate Documentation
- Repository: PyLate on GitHub
- Hugging Face: PyLate models on Hugging Face
Full Model Architecture
ColBERT(
(0): Transformer({'max_seq_length': 255, 'do_lower_case': False}) with Transformer model: ModernBertModel
(1): Dense({'in_features': 256, 'out_features': 128, 'bias': False, 'activation_function': 'torch.nn.modules.linear.Identity'})
)
Usage
First install the PyLate library:
pip install -U pylate
Retrieval
PyLate provides a streamlined interface to index and retrieve documents using ColBERT models. The index leverages the Voyager HNSW index to efficiently handle document embeddings and enable fast retrieval.
Indexing documents
First, load the ColBERT model and initialize the Voyager index, then encode and index your documents:
from pylate import indexes, models, retrieve
# Step 1: Load the ColBERT model
model = models.ColBERT(
model_name_or_path=yosefw/colbert-ettin-17m,
)
# Step 2: Initialize the Voyager index
index = indexes.Voyager(
index_folder="pylate-index",
index_name="index",
override=True, # This overwrites the existing index if any
)
# Step 3: Encode the documents
documents_ids = ["1", "2", "3"]
documents = ["document 1 text", "document 2 text", "document 3 text"]
documents_embeddings = model.encode(
documents,
batch_size=32,
is_query=False, # Ensure that it is set to False to indicate that these are documents, not queries
show_progress_bar=True,
)
# Step 4: Add document embeddings to the index by providing embeddings and corresponding ids
index.add_documents(
documents_ids=documents_ids,
documents_embeddings=documents_embeddings,
)
Note that you do not have to recreate the index and encode the documents every time. Once you have created an index and added the documents, you can re-use the index later by loading it:
# To load an index, simply instantiate it with the correct folder/name and without overriding it
index = indexes.Voyager(
index_folder="pylate-index",
index_name="index",
)
Retrieving top-k documents for queries
Once the documents are indexed, you can retrieve the top-k most relevant documents for a given set of queries. To do so, initialize the ColBERT retriever with the index you want to search in, encode the queries and then retrieve the top-k documents to get the top matches ids and relevance scores:
# Step 1: Initialize the ColBERT retriever
retriever = retrieve.ColBERT(index=index)
# Step 2: Encode the queries
queries_embeddings = model.encode(
["query for document 3", "query for document 1"],
batch_size=32,
is_query=True, # # Ensure that it is set to False to indicate that these are queries
show_progress_bar=True,
)
# Step 3: Retrieve top-k documents
scores = retriever.retrieve(
queries_embeddings=queries_embeddings,
k=10, # Retrieve the top 10 matches for each query
)
Reranking
If you only want to use the ColBERT model to perform reranking on top of your first-stage retrieval pipeline without building an index, you can simply use rank function and pass the queries and documents to rerank:
from pylate import rank, models
queries = [
"query A",
"query B",
]
documents = [
["document A", "document B"],
["document 1", "document C", "document B"],
]
documents_ids = [
[1, 2],
[1, 3, 2],
]
model = models.ColBERT(
model_name_or_path=yosefw/colbert-ettin-17m,
)
queries_embeddings = model.encode(
queries,
is_query=True,
)
documents_embeddings = model.encode(
documents,
is_query=False,
)
reranked_documents = rank.rerank(
documents_ids=documents_ids,
queries_embeddings=queries_embeddings,
documents_embeddings=documents_embeddings,
)
Evaluation
Metrics
Col BERTTriplet
- Evaluated with
pylate.evaluation.colbert_triplet.ColBERTTripletEvaluator
Metric | Value |
---|---|
accuracy | 0.798 |
Training Details
Training Dataset
Unnamed Dataset
- Size: 972,246 training samples
- Columns:
query
,positive
,negative_1
,negative_2
, andnegative_3
- Approximate statistics based on the first 1000 samples:
query positive negative_1 negative_2 negative_3 type string string string string string details - min: 5 tokens
- mean: 10.04 tokens
- max: 22 tokens
- min: 21 tokens
- mean: 31.92 tokens
- max: 32 tokens
- min: 19 tokens
- mean: 31.92 tokens
- max: 32 tokens
- min: 19 tokens
- mean: 31.92 tokens
- max: 32 tokens
- min: 22 tokens
- mean: 31.94 tokens
- max: 32 tokens
- Samples:
query positive negative_1 negative_2 negative_3 what diseases does sugar cause
Eating too much sugar raises your risk for gaining weight and the health problems that are associated with being overweight. You are more likely to suffer diabetes, heart disease, high blood pressure, cancer and many other health conditions when you indulge your sweet tooth too often.Table sugar isn’t the only culprit when it comes to sugar.ou are more likely to suffer diabetes, heart disease, high blood pressure, cancer and many other health conditions when you indulge your sweet tooth too often. Table sugar isn’t the only culprit when it comes to sugar.
Sugar and Heart Disease. Lately, we’ve seen epidemiological data suggesting that increased intake of sugar-sweetened beverages increases the risk for metabolic syndrome, type 2 diabetes, coronary heart disease, and stroke (5).ugar and Heart Disease. Lately, we’ve seen epidemiological data suggesting that increased intake of sugar-sweetened beverages increases the risk for metabolic syndrome, type 2 diabetes, coronary heart disease, and stroke (5).
High intake of sugar and refined carbohydrates is associated with increased risk of diabetes, metabolic syndrome, non-alcoholic fatty liver disease, lipid disorders and high blood pressure.ugar and Heart Disease. Lately, we’ve seen epidemiological data suggesting that increased intake of sugar-sweetened beverages increases the risk for metabolic syndrome, type 2 diabetes, coronary heart disease, and stroke (5).
Sugar Disease is a problem that manifests in different ways in different individuals, of different ages and of different genetic susceptibility-but its three cardinal forms are:he glycemic index: key to diet for Sugar Disease. Some have proposed that persons with variants of Sugar Disease follow a diet that rigidly excludes carbohydrates, concentrating instead on meat and vegetables. In my opinion this is rarely necessary and results in dietary imbalances.
what diseases does sugar cause
Eating too much sugar raises your risk for gaining weight and the health problems that are associated with being overweight. You are more likely to suffer diabetes, heart disease, high blood pressure, cancer and many other health conditions when you indulge your sweet tooth too often.Table sugar isn’t the only culprit when it comes to sugar.ou are more likely to suffer diabetes, heart disease, high blood pressure, cancer and many other health conditions when you indulge your sweet tooth too often. Table sugar isn’t the only culprit when it comes to sugar.
Another mechanism whereby sugar consumption may increase the risk of cardiovascular disease is through its effects on blood pressure. It is well known that high blood pressure increases the risk for cardiovascular disease.ugar and Heart Disease. Lately, we’ve seen epidemiological data suggesting that increased intake of sugar-sweetened beverages increases the risk for metabolic syndrome, type 2 diabetes, coronary heart disease, and stroke (5).
The term Sugar Disease is a convenient catch-all for a host of modern conditions that result from an unbridled intake of sugar or refined carbohydrates coupled with a sedentary lifestyle.he glycemic index: key to diet for Sugar Disease. Some have proposed that persons with variants of Sugar Disease follow a diet that rigidly excludes carbohydrates, concentrating instead on meat and vegetables. In my opinion this is rarely necessary and results in dietary imbalances.
Brown sugar is just sucrose with molasses – same basic composition. Glucose, or blood sugar, is the sugar that circulates in your blood. Fructose, or fruit sugar, is found in plants and honey. It’s the fructose in sugar that causes the problem, as you will see.That doesn’t mean you shouldn’t eat whole fruit; whole fruit contains fiber that slows digestion.It does mean that fruit juice poses a danger.t’s not a fact that sugar causes cancer, and Lustig does not become an absolute authority by virtue of his lecture going viral. It’s not the way medical science works, or should work: web virality implies popularity, not truth.
average cost per square foot to build a house
Generally, turnkey costs will start at around $70 a square foot for a starter home in ideal conditions. An average level of finish will be more like $90-100. At $110-120 we are building a custom finish.
Home prices of $55, $66, $72, $80, $84, $92, $110, $118, and $328 per square foot combine to produce an average of $112 per square foot, which is probably a reasonable figure for many areas of the country. However, the difference between the lowest figure and the highest is very substantial.
An average commercial steel building costs between $16 and $20 per square foot, including building package (I-Beams, purlins, girts etc.) , delivery, foundation and the cost of construction.
At first glance, the average home building cost per square foot seems extremely high. People who do a lot of home improvement jobs are usually the first ones to question, because they know the cost of materials. The fact is that the materials are about 25-33 percent of the cost of a house.
- Loss:
pylate.losses.contrastive.Contrastive
Evaluation Dataset
Unnamed Dataset
- Size: 20,000 evaluation samples
- Columns:
query
,positive
,negative_1
,negative_2
, andnegative_3
- Approximate statistics based on the first 1000 samples:
query positive negative_1 negative_2 negative_3 type string string string string string details - min: 5 tokens
- mean: 10.12 tokens
- max: 20 tokens
- min: 28 tokens
- mean: 31.99 tokens
- max: 32 tokens
- min: 19 tokens
- mean: 31.9 tokens
- max: 32 tokens
- min: 19 tokens
- mean: 31.95 tokens
- max: 32 tokens
- min: 19 tokens
- mean: 31.92 tokens
- max: 32 tokens
- Samples:
query positive negative_1 negative_2 negative_3 what are the dimensions of a regulation nba backboard?
What is the size of a basketball backboard? Per NBA regulations, the dimensions of a basketball backboard are 6 feet wide by 3 ½ feet high. The backboard is marked with a 2-inch white rectangle that is centered behind the ring; the rectangle's outer dimensions are 24 inches wide by 18 inches high. The basket itself consists of a metal ring with an 18-inch inner diameter and a white cord net that is 15 to 18 inches long.
The Backboard and Rim: The regulation distance from the ground to the top of the rim is 10 feet for all levels of play. Regulation backboards are 6 feet wide (72 inches) by 42 inches tall. All basketball rims (hoops) are 18 inches in diameter. The inner square on the backboard is 24 inches wide by 18 inches tall. All line markings on the floor are 2 inches wide and can vary in color.
72 Backboard (Regulation) 72 x 42 backboard systems are the best choice if you're a serious basketball enthusiast or a former high school or college competitor. 1 The official, regulation size used in high school, college, and NBA competition. Ideal system for a dedicated backyard court or a large 3 car garage driveway.
Each basket ring shall be securely attached to the backboard with its upper edge 10' above and parallel to the floor and equidistant from the vertical edges of the board. The nearest point of the inside edge of the ring shall be 6 from the plane of the face of the board.
what are the dimensions of a regulation nba backboard?
What is the size of a basketball backboard? Per NBA regulations, the dimensions of a basketball backboard are 6 feet wide by 3 ½ feet high. The backboard is marked with a 2-inch white rectangle that is centered behind the ring; the rectangle's outer dimensions are 24 inches wide by 18 inches high. The basket itself consists of a metal ring with an 18-inch inner diameter and a white cord net that is 15 to 18 inches long.
72 Backboard (Regulation) 72 x 42 backboard systems are the best choice if you're a serious basketball enthusiast or a former high school or college competitor. The official, regulation size used in high school, college, and NBA competition.
What are the standard measurements for a basketball backboard? The standard size for a basketball backboard is 6 feet wide and 3 1/2 feet high. The surface should be flat and usually there is a 2-inch thick rectangle above the rim that is 24 inches wide and 18 inches high.
The areas identified by the lane space markings are 2 by 8 inches and the neutral zone marks are 12 by 8. c. A free throw line shall be drawn (2 wide) across each of the circles indicated in the court diagram. It shall be parallel to the end line and shall be 15' from the plane of the face of the backboard.
what is a usb receiver
The Logitech Unifying receiver is a miniaturised dedicated USB wireless receiver which permits up to 6 devices such as mice and keyboards (headphones are not compatible), which must be made by Logitech and of compatible design, to be linked to the same computer using 2.4 GHz band radio communication in a way very similar to, but incompatible with ...
Wireless USB. Wireless USB is a short-range, high-bandwidth wireless radio communication protocol created by the Wireless USB Promoter Group which intends to further increase the availability of general USB-based technologies. It is maintained by the WiMedia Alliance and (as of 2009) the current revision is 1.0, which was approved in 2005.
Replacement USB RF receiver for current Air Mouse Elite and Air Mouse... Replacement USB RF receiver for current Air Mouse Elite and Air Mouse GO Plus products. Enables a 100-foot (30-meter) wireless range. Before ordering for GO Plus, confirm that your mouse unit is labeled AS04130. Consider purchasing two to ensure uninterrupted Air Mouse operation.
VicTsing MM057 2.4G Wireless Portable Mobile Mouse Optical Mice with USB Receiver, 5 Adjustable DPI Levels, 6... by VicTsing. $ 9 99 $19.99Prime. Get it by Tomorrow, Apr 21. 50% off item with purchase of 1 items. 15% off item with purchase of 1 items. See Details.
- Loss:
pylate.losses.contrastive.Contrastive
Training Hyperparameters
Non-Default Hyperparameters
eval_strategy
: epochper_device_train_batch_size
: 64per_device_eval_batch_size
: 64learning_rate
: 8e-06num_train_epochs
: 4lr_scheduler_type
: cosinewarmup_ratio
: 0.05fp16
: Trueload_best_model_at_end
: Truepush_to_hub
: True
All Hyperparameters
Click to expand
overwrite_output_dir
: Falsedo_predict
: Falseeval_strategy
: epochprediction_loss_only
: Trueper_device_train_batch_size
: 64per_device_eval_batch_size
: 64per_gpu_train_batch_size
: Noneper_gpu_eval_batch_size
: Nonegradient_accumulation_steps
: 1eval_accumulation_steps
: Nonetorch_empty_cache_steps
: Nonelearning_rate
: 8e-06weight_decay
: 0.0adam_beta1
: 0.9adam_beta2
: 0.999adam_epsilon
: 1e-08max_grad_norm
: 1.0num_train_epochs
: 4max_steps
: -1lr_scheduler_type
: cosinelr_scheduler_kwargs
: {}warmup_ratio
: 0.05warmup_steps
: 0log_level
: passivelog_level_replica
: warninglog_on_each_node
: Truelogging_nan_inf_filter
: Truesave_safetensors
: Truesave_on_each_node
: Falsesave_only_model
: Falserestore_callback_states_from_checkpoint
: Falseno_cuda
: Falseuse_cpu
: Falseuse_mps_device
: Falseseed
: 42data_seed
: Nonejit_mode_eval
: Falseuse_ipex
: Falsebf16
: Falsefp16
: Truefp16_opt_level
: O1half_precision_backend
: autobf16_full_eval
: Falsefp16_full_eval
: Falsetf32
: Nonelocal_rank
: 0ddp_backend
: Nonetpu_num_cores
: Nonetpu_metrics_debug
: Falsedebug
: []dataloader_drop_last
: Falsedataloader_num_workers
: 0dataloader_prefetch_factor
: Nonepast_index
: -1disable_tqdm
: Falseremove_unused_columns
: Truelabel_names
: Noneload_best_model_at_end
: Trueignore_data_skip
: Falsefsdp
: []fsdp_min_num_params
: 0fsdp_config
: {'min_num_params': 0, 'xla': False, 'xla_fsdp_v2': False, 'xla_fsdp_grad_ckpt': False}fsdp_transformer_layer_cls_to_wrap
: Noneaccelerator_config
: {'split_batches': False, 'dispatch_batches': None, 'even_batches': True, 'use_seedable_sampler': True, 'non_blocking': False, 'gradient_accumulation_kwargs': None}deepspeed
: Nonelabel_smoothing_factor
: 0.0optim
: adamw_torchoptim_args
: Noneadafactor
: Falsegroup_by_length
: Falselength_column_name
: lengthddp_find_unused_parameters
: Noneddp_bucket_cap_mb
: Noneddp_broadcast_buffers
: Falsedataloader_pin_memory
: Truedataloader_persistent_workers
: Falseskip_memory_metrics
: Trueuse_legacy_prediction_loop
: Falsepush_to_hub
: Trueresume_from_checkpoint
: Nonehub_model_id
: Nonehub_strategy
: every_savehub_private_repo
: Nonehub_always_push
: Falsegradient_checkpointing
: Falsegradient_checkpointing_kwargs
: Noneinclude_inputs_for_metrics
: Falseinclude_for_metrics
: []eval_do_concat_batches
: Truefp16_backend
: autopush_to_hub_model_id
: Nonepush_to_hub_organization
: Nonemp_parameters
:auto_find_batch_size
: Falsefull_determinism
: Falsetorchdynamo
: Noneray_scope
: lastddp_timeout
: 1800torch_compile
: Falsetorch_compile_backend
: Nonetorch_compile_mode
: Nonedispatch_batches
: Nonesplit_batches
: Noneinclude_tokens_per_second
: Falseinclude_num_input_tokens_seen
: Falseneftune_noise_alpha
: Noneoptim_target_modules
: Nonebatch_eval_metrics
: Falseeval_on_start
: Falseuse_liger_kernel
: Falseeval_use_gather_object
: Falseaverage_tokens_across_devices
: Falseprompts
: Nonebatch_sampler
: batch_samplermulti_dataset_batch_sampler
: proportional
Training Logs
Epoch | Step | Training Loss | Validation Loss | accuracy |
---|---|---|---|---|
1.0 | 15192 | 1.3895 | - | - |
0 | 0 | - | - | 0.7882 |
1.0 | 15192 | - | 1.0425 | - |
2.0 | 30384 | 0.9597 | - | - |
0 | 0 | - | - | 0.7972 |
2.0 | 30384 | - | 1.0071 | - |
3.0 | 45576 | 0.8756 | - | - |
0 | 0 | - | - | 0.7979 |
3.0 | 45576 | - | 1.0083 | - |
4.0 | 60768 | 0.8355 | - | - |
0 | 0 | - | - | 0.7978 |
4.0 | 60768 | - | 1.0145 | - |
0 | 0 | - | - | 0.7980 |
- The bold row denotes the saved checkpoint.
Framework Versions
- Python: 3.11.11
- Sentence Transformers: 4.0.2
- PyLate: 1.2.0
- Transformers: 4.48.2
- PyTorch: 2.6.0+cu124
- Accelerate: 1.5.2
- Datasets: 3.6.0
- Tokenizers: 0.21.1
Citation
BibTeX
Sentence Transformers
@inproceedings{reimers-2019-sentence-bert,
title = "Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks",
author = "Reimers, Nils and Gurevych, Iryna",
booktitle = "Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing",
month = "11",
year = "2019",
publisher = "Association for Computational Linguistics",
url = "https://arxiv.org/abs/1908.10084"
}
PyLate
@misc{PyLate,
title={PyLate: Flexible Training and Retrieval for Late Interaction Models},
author={Chaffin, Antoine and Sourty, Raphaël},
url={https://github.com/lightonai/pylate},
year={2024}
}
- Downloads last month
- 10
Model tree for yosefw/colbert-ettin-17m
Base model
jhu-clsp/ettin-encoder-17m