|
--- |
|
language: |
|
- en |
|
tags: |
|
- ColBERT |
|
- PyLate |
|
- sentence-transformers |
|
- sentence-similarity |
|
- feature-extraction |
|
- generated_from_trainer |
|
- dataset_size:99515 |
|
- loss:Contrastive |
|
base_model: lightonai/GTE-ModernColBERT-v1 |
|
datasets: |
|
- reasonir/reasonir-data |
|
pipeline_tag: sentence-similarity |
|
library_name: PyLate |
|
metrics: |
|
- accuracy |
|
model-index: |
|
- name: PyLate model based on lightonai/GTE-ModernColBERT-v1 |
|
results: |
|
- task: |
|
type: col-berttriplet |
|
name: Col BERTTriplet |
|
dataset: |
|
name: Unknown |
|
type: unknown |
|
metrics: |
|
- type: accuracy |
|
value: 0.9970178604125977 |
|
name: Accuracy |
|
license: cc-by-nc-4.0 |
|
--- |
|
[<img src="https://cdn-avatars.huggingface.co/v1/production/uploads/67b2f4e49edebc815a3a4739/R1g957j1aBbx8lhZbWmxw.jpeg" width="200"/>](https://huggingface.co/fjmgAI) |
|
## Fine-Tuned Model |
|
|
|
**`fjmgAI/reason-colBERT-150M-GTE-ModernColBERT`** |
|
|
|
## Base Model |
|
**`lightonai/GTE-ModernColBERT-v1`** |
|
|
|
## Fine-Tuning Method |
|
Fine-tuning was performed using **[PyLate](https://github.com/lightonai/pylate)**, with contrastive training on the [rag-comprehensive-triplets](https://huggingface.co/datasets/baconnier/rag-comprehensive-triplets) dataset. It maps sentences & paragraphs to sequences of 128-dimensional dense vectors and can be used for semantic textual similarity using the MaxSim operator. |
|
## Dataset |
|
**[`reasonir/reasonir-data`](https://huggingface.co/datasets/reasonir/reasonir-data)** |
|
|
|
### Description |
|
This dataset has been used for the English language and contains **101,000 examples**, designed for **rag-comprehensive-triplets**, using a data preprocessing script from the BRIGHT dataset. |
|
## Fine-Tuning Details |
|
- The model was trained using the **Contrastive Training**. |
|
- Evaluated with <code>pylate.evaluation.colbert_triplet.ColBERTTripletEvaluator</code> |
|
|
|
| Metric | Value | |
|
|:-------------|:-----------| |
|
| **accuracy** | **0.997** | |
|
|
|
|
|
## Usage |
|
First install the PyLate library: |
|
|
|
```bash |
|
pip install -U pylate |
|
``` |
|
|
|
### Retrieval |
|
|
|
PyLate provides a streamlined interface to index and retrieve documents using ColBERT models. The index leverages the Voyager HNSW index to efficiently handle document embeddings and enable fast retrieval. |
|
|
|
#### Indexing documents |
|
|
|
First, load the ColBERT model and initialize the Voyager index, then encode and index your documents: |
|
|
|
```python |
|
import torch |
|
from pylate import indexes, models, retrieve |
|
|
|
# Step 1: Load the ColBERT model and Move the model to GPU if available, otherwise use CPU |
|
model = models.ColBERT( |
|
model_name_or_path=("fjmgAI/reason-colBERT-150M-GTE-ModernColBERT", trust_remote_code=True) |
|
) |
|
|
|
device = torch.device("cuda" if torch.cuda.is_available() else "cpu") |
|
|
|
model.to(device) |
|
|
|
# Step 2: Initialize the Voyager index |
|
index = indexes.Voyager( |
|
index_folder="pylate-index", |
|
index_name="index", |
|
override=True, # This overwrites the existing index if any |
|
) |
|
|
|
# Step 3: Encode the documents |
|
documents_ids = ["1", "2", "3"] |
|
documents = ["document 1 text", "document 2 text", "document 3 text"] |
|
|
|
documents_embeddings = model.encode( |
|
documents, |
|
batch_size=32, |
|
is_query=False, # Ensure that it is set to False to indicate that these are documents, not queries |
|
show_progress_bar=True, |
|
) |
|
|
|
# Step 4: Add document embeddings to the index by providing embeddings and corresponding ids |
|
index.add_documents( |
|
documents_ids=documents_ids, |
|
documents_embeddings=documents_embeddings, |
|
) |
|
``` |
|
|
|
Note that you do not have to recreate the index and encode the documents every time. Once you have created an index and added the documents, you can re-use the index later by loading it: |
|
|
|
```python |
|
# To load an index, simply instantiate it with the correct folder/name and without overriding it |
|
index = indexes.Voyager( |
|
index_folder="pylate-index", |
|
index_name="index", |
|
) |
|
``` |
|
|
|
#### Retrieving top-k documents for queries |
|
|
|
Once the documents are indexed, you can retrieve the top-k most relevant documents for a given set of queries. |
|
To do so, initialize the ColBERT retriever with the index you want to search in, encode the queries and then retrieve the top-k documents to get the top matches ids and relevance scores: |
|
|
|
```python |
|
# Step 1: Initialize the ColBERT retriever |
|
retriever = retrieve.ColBERT(index=index) |
|
|
|
# Step 2: Encode the queries |
|
queries_embeddings = model.encode( |
|
["query for document 3", "query for document 1"], |
|
batch_size=32, |
|
is_query=True, # # Ensure that it is set to False to indicate that these are queries |
|
show_progress_bar=True, |
|
) |
|
|
|
# Step 3: Retrieve top-k documents |
|
scores = retriever.retrieve( |
|
queries_embeddings=queries_embeddings, |
|
k=10, # Retrieve the top 10 matches for each query |
|
) |
|
``` |
|
|
|
### Reranking |
|
If you only want to use the ColBERT model to perform reranking on top of your first-stage retrieval pipeline without building an index, you can simply use rank function and pass the queries and documents to rerank: |
|
|
|
```python |
|
import torch |
|
from pylate import rank, models |
|
|
|
queries = [ |
|
"query A", |
|
"query B", |
|
] |
|
|
|
documents = [ |
|
["document A", "document B"], |
|
["document 1", "document C", "document B"], |
|
] |
|
|
|
documents_ids = [ |
|
[1, 2], |
|
[1, 3, 2], |
|
] |
|
|
|
model = models.ColBERT( |
|
model_name_or_path=("fjmgAI/reason-colBERT-150M-GTE-ModernColBERT", trust_remote_code=True), |
|
) |
|
|
|
device = torch.device("cuda" if torch.cuda.is_available() else "cpu") |
|
|
|
model.to(device) |
|
|
|
queries_embeddings = model.encode( |
|
queries, |
|
is_query=True, |
|
) |
|
|
|
documents_embeddings = model.encode( |
|
documents, |
|
is_query=False, |
|
) |
|
|
|
reranked_documents = rank.rerank( |
|
documents_ids=documents_ids, |
|
queries_embeddings=queries_embeddings, |
|
documents_embeddings=documents_embeddings, |
|
) |
|
``` |
|
|
|
|
|
|
|
### Framework Versions |
|
- Python: 3.10.12 |
|
- Sentence Transformers: 4.0.2 |
|
- PyLate: 1.2.0 |
|
- Transformers: 4.48.2 |
|
- PyTorch: 2.5.1+cu121 |
|
- Accelerate: 1.2.1 |
|
- Datasets: 3.3.1 |
|
- Tokenizers: 0.21.0 |
|
|
|
|
|
## Purpose |
|
This tuned model is designed to be used in scenarios that require **efficient embedding-based retrieval through reasoning** comparing embeddings at the token level with its MaxSim operation, ideal for **question-answering and document retrieval**. |
|
|
|
|
|
- **Developed by:** fjmgAI |
|
- **License:** |
|
Unfortunately, since the [ReasonIR data](https://huggingface.co/datasets/reasonir/reasonir-data) has been released under a cc-by-nc-4.0 license, we cannot release this model under an Apache 2.0 license. However, the authors of ReasonIR [released code to generate the data](https://github.com/facebookresearch/ReasonIR/tree/main/synthetic_data_generation). Anyone willing to reproduce the data could then easily reproduce this model under an Apache 2.0 |
|
|
|
[<img src="https://github.com/lightonai/pylate/blob/main/docs/img/logo.png?raw=true" width="200"/>](https://github.com/lightonai/pylate) |