Update README.md

7e18932 verified 10 days ago

6.78 kB

	---
	language:
	- en
	tags:
	- ColBERT
	- PyLate
	- sentence-transformers
	- sentence-similarity
	- feature-extraction
	- generated_from_trainer
	- dataset_size:99515
	- loss:Contrastive
	base_model: lightonai/GTE-ModernColBERT-v1
	datasets:
	- reasonir/reasonir-data
	pipeline_tag: sentence-similarity
	library_name: PyLate
	metrics:
	- accuracy
	model-index:
	- name: PyLate model based on lightonai/GTE-ModernColBERT-v1
	results:
	- task:
	type: col-berttriplet
	name: Col BERTTriplet
	dataset:
	name: Unknown
	type: unknown
	metrics:
	- type: accuracy
	value: 0.9970178604125977
	name: Accuracy
	license: cc-by-nc-4.0
	---
	[<img src="https://cdn-avatars.huggingface.co/v1/production/uploads/67b2f4e49edebc815a3a4739/R1g957j1aBbx8lhZbWmxw.jpeg" width="200"/>](https://huggingface.co/fjmgAI)
	## Fine-Tuned Model

	`fjmgAI/reason-colBERT-150M-GTE-ModernColBERT`

	## Base Model
	`lightonai/GTE-ModernColBERT-v1`

	## Fine-Tuning Method
	Fine-tuning was performed using [PyLate](https://github.com/lightonai/pylate), with contrastive training on the [rag-comprehensive-triplets](https://huggingface.co/datasets/baconnier/rag-comprehensive-triplets) dataset. It maps sentences & paragraphs to sequences of 128-dimensional dense vectors and can be used for semantic textual similarity using the MaxSim operator.
	## Dataset
	[`reasonir/reasonir-data`](https://huggingface.co/datasets/reasonir/reasonir-data)

	### Description
	This dataset has been used for the English language and contains 101,000 examples, designed for rag-comprehensive-triplets, using a data preprocessing script from the BRIGHT dataset.
	## Fine-Tuning Details
	- The model was trained using the Contrastive Training.
	- Evaluated with <code>pylate.evaluation.colbert_triplet.ColBERTTripletEvaluator</code>

	\| Metric \| Value \|
	\|:-------------\|:-----------\|
	\| accuracy \| 0.997 \|


	## Usage
	First install the PyLate library:

	```bash
	pip install -U pylate
	```

	### Retrieval

	PyLate provides a streamlined interface to index and retrieve documents using ColBERT models. The index leverages the Voyager HNSW index to efficiently handle document embeddings and enable fast retrieval.

	#### Indexing documents

	First, load the ColBERT model and initialize the Voyager index, then encode and index your documents:

	```python
	import torch
	from pylate import indexes, models, retrieve

	# Step 1: Load the ColBERT model and Move the model to GPU if available, otherwise use CPU
	model = models.ColBERT(
	model_name_or_path=("fjmgAI/reason-colBERT-150M-GTE-ModernColBERT", trust_remote_code=True)
	)

	device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

	model.to(device)

	# Step 2: Initialize the Voyager index
	index = indexes.Voyager(
	index_folder="pylate-index",
	index_name="index",
	override=True, # This overwrites the existing index if any
	)

	# Step 3: Encode the documents
	documents_ids = ["1", "2", "3"]
	documents = ["document 1 text", "document 2 text", "document 3 text"]

	documents_embeddings = model.encode(
	documents,
	batch_size=32,
	is_query=False, # Ensure that it is set to False to indicate that these are documents, not queries
	show_progress_bar=True,
	)

	# Step 4: Add document embeddings to the index by providing embeddings and corresponding ids
	index.add_documents(
	documents_ids=documents_ids,
	documents_embeddings=documents_embeddings,
	)
	```

	Note that you do not have to recreate the index and encode the documents every time. Once you have created an index and added the documents, you can re-use the index later by loading it:

	```python
	# To load an index, simply instantiate it with the correct folder/name and without overriding it
	index = indexes.Voyager(
	index_folder="pylate-index",
	index_name="index",
	)
	```

	#### Retrieving top-k documents for queries

	Once the documents are indexed, you can retrieve the top-k most relevant documents for a given set of queries.
	To do so, initialize the ColBERT retriever with the index you want to search in, encode the queries and then retrieve the top-k documents to get the top matches ids and relevance scores:

	```python
	# Step 1: Initialize the ColBERT retriever
	retriever = retrieve.ColBERT(index=index)

	# Step 2: Encode the queries
	queries_embeddings = model.encode(
	["query for document 3", "query for document 1"],
	batch_size=32,
	is_query=True, # # Ensure that it is set to False to indicate that these are queries
	show_progress_bar=True,
	)

	# Step 3: Retrieve top-k documents
	scores = retriever.retrieve(
	queries_embeddings=queries_embeddings,
	k=10, # Retrieve the top 10 matches for each query
	)
	```

	### Reranking
	If you only want to use the ColBERT model to perform reranking on top of your first-stage retrieval pipeline without building an index, you can simply use rank function and pass the queries and documents to rerank:

	```python
	import torch
	from pylate import rank, models

	queries = [
	"query A",
	"query B",
	]

	documents = [
	["document A", "document B"],
	["document 1", "document C", "document B"],
	]

	documents_ids = [
	[1, 2],
	[1, 3, 2],
	]

	model = models.ColBERT(
	model_name_or_path=("fjmgAI/reason-colBERT-150M-GTE-ModernColBERT", trust_remote_code=True),
	)

	device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

	model.to(device)

	queries_embeddings = model.encode(
	queries,
	is_query=True,
	)

	documents_embeddings = model.encode(
	documents,
	is_query=False,
	)

	reranked_documents = rank.rerank(
	documents_ids=documents_ids,
	queries_embeddings=queries_embeddings,
	documents_embeddings=documents_embeddings,
	)
	```



	### Framework Versions
	- Python: 3.10.12
	- Sentence Transformers: 4.0.2
	- PyLate: 1.2.0
	- Transformers: 4.48.2
	- PyTorch: 2.5.1+cu121
	- Accelerate: 1.2.1
	- Datasets: 3.3.1
	- Tokenizers: 0.21.0


	## Purpose
	This tuned model is designed to be used in scenarios that require efficient embedding-based retrieval through reasoning comparing embeddings at the token level with its MaxSim operation, ideal for question-answering and document retrieval.


	- Developed by: fjmgAI
	- License:
	Unfortunately, since the [ReasonIR data](https://huggingface.co/datasets/reasonir/reasonir-data) has been released under a cc-by-nc-4.0 license, we cannot release this model under an Apache 2.0 license. However, the authors of ReasonIR [released code to generate the data](https://github.com/facebookresearch/ReasonIR/tree/main/synthetic_data_generation). Anyone willing to reproduce the data could then easily reproduce this model under an Apache 2.0

	[<img src="https://github.com/lightonai/pylate/blob/main/docs/img/logo.png?raw=true" width="200"/>](https://github.com/lightonai/pylate)