|
--- |
|
tags: |
|
- ColBERT |
|
- PyLate |
|
- sentence-similarity |
|
- feature-extraction |
|
- generated_from_trainer |
|
- dataset_size:909188 |
|
- loss:Contrastive |
|
base_model: EuroBERT/EuroBERT-610m |
|
datasets: |
|
- baconnier/rag-comprehensive-triplets |
|
pipeline_tag: sentence-similarity |
|
library_name: PyLate |
|
metrics: |
|
- accuracy |
|
model-index: |
|
- name: PyLate model based on EuroBERT/EuroBERT-210m |
|
results: |
|
- task: |
|
type: col-berttriplet |
|
name: Col BERTTriplet |
|
dataset: |
|
name: Unknown |
|
type: unknown |
|
metrics: |
|
- type: accuracy |
|
value: 0.9841766953468323 |
|
name: Accuracy |
|
license: apache-2.0 |
|
language: |
|
- es |
|
- en |
|
--- |
|
[<img src="https://cdn-avatars.huggingface.co/v1/production/uploads/67b2f4e49edebc815a3a4739/R1g957j1aBbx8lhZbWmxw.jpeg" width="200"/>](https://huggingface.co/fjmgAI) |
|
|
|
## Fine-Tuned Model |
|
|
|
**`fjmgAI/col1-610M-EuroBERT`** |
|
|
|
## Base Model |
|
**`EuroBERT/EuroBERT-610m`** |
|
|
|
## Fine-Tuning Method |
|
Fine-tuning was performed using **[PyLate](https://github.com/lightonai/pylate)**, with contrastive training on the [rag-comprehensive-triplets](https://huggingface.co/datasets/baconnier/rag-comprehensive-triplets) dataset. It maps sentences & paragraphs to sequences of 128-dimensional dense vectors and can be used for semantic textual similarity using the MaxSim operator. |
|
|
|
## Dataset |
|
**[`baconnier/rag-comprehensive-triplets`](https://huggingface.co/datasets/baconnier/rag-comprehensive-triplets)** |
|
|
|
### Description |
|
This dataset has been filtered for the Spanish language containing **303,000 examples**, designed for **rag-comprehensive-triplets**. |
|
|
|
## Fine-Tuning Details |
|
- The model was trained using the **Contrastive Training**. |
|
- Evaluated with <code>pylate.evaluation.colbert_triplet.ColBERTTripletEvaluator</code> |
|
|
|
| Metric | Value | |
|
|:-------------|:-----------| |
|
| **accuracy** | **0.98417** | |
|
|
|
## Usage |
|
First install the PyLate library: |
|
|
|
```bash |
|
pip install -U pylate |
|
``` |
|
|
|
### Calculate Similarity |
|
|
|
```python |
|
import torch |
|
from pylate import models |
|
|
|
# Load the ColBERT model |
|
model = models.ColBERT("fjmgAI/col1-610M-EuroBERT", trust_remote_code=True) |
|
|
|
# Move the model to GPU if available, otherwise use CPU |
|
device = torch.device("cuda" if torch.cuda.is_available() else "cpu") |
|
model.to(device) |
|
|
|
# Example data for similarity comparison |
|
query = "¿Cuál es la capital de España?" # Query sentence |
|
positive_doc = "La capital de España es Madrid." # Relevant document |
|
negative_doc = "Florida es un estado en los Estados Unidos." # Irrelevant document |
|
sentences = [query, positive_doc, negative_doc] # Combine all texts |
|
|
|
# Tokenize the input sentences using ColBERT's tokenizer |
|
inputs = model.tokenize(sentences) |
|
|
|
# Move all input tensors to the same device as the model (GPU/CPU) |
|
inputs = {key: value.to(device) for key, value in inputs.items()} |
|
|
|
# Generate token embeddings (no gradients needed for inference) |
|
with torch.no_grad(): |
|
embeddings_dict = model(inputs) |
|
embeddings = embeddings_dict['token_embeddings'] |
|
|
|
# Define ColBERT's MaxSim similarity function |
|
def colbert_similarity(query_emb, doc_emb): |
|
""" |
|
Computes ColBERT-style similarity between query and document embeddings. |
|
Uses maximum similarity (MaxSim) between individual tokens. |
|
|
|
Args: |
|
query_emb: [query_tokens, embedding_dim] |
|
doc_emb: [doc_tokens, embedding_dim] |
|
|
|
Returns: |
|
Normalized similarity score |
|
""" |
|
# Compute dot product between all token pairs |
|
similarity_matrix = torch.matmul(query_emb, doc_emb.T) |
|
|
|
# Get maximum similarity for each query token (MaxSim) |
|
max_similarities = similarity_matrix.max(dim=1)[0] |
|
|
|
# Return average of maximum similarities (normalized by query length) |
|
return max_similarities.sum() / query_emb.shape[0] |
|
|
|
# Extract embeddings for each text |
|
query_emb = embeddings[0] |
|
positive_emb = embeddings[1] |
|
negative_emb = embeddings[2] |
|
|
|
# Compute similarity scores |
|
positive_score = colbert_similarity(query_emb, positive_emb) |
|
negative_score = colbert_similarity(query_emb, negative_emb) |
|
|
|
print(f"Similarity with positive document: {positive_score.item():.4f}") |
|
print(f"Similarity with negative document: {negative_score.item():.4f}") |
|
``` |
|
|
|
## Framework Versions |
|
- Python: 3.10.12 |
|
- Sentence Transformers: 3.4.1 |
|
- PyLate: 1.1.7 |
|
- Transformers: 4.48.2 |
|
- PyTorch: 2.5.1+cu121 |
|
- Accelerate: 1.2.1 |
|
- Datasets: 3.3.1 |
|
- Tokenizers: 0.21.0 |
|
|
|
## Purpose |
|
This tuned model is designed for **Spanish applications** that require the use of **efficient semantic search** comparing embeddings at the token level with its MaxSim operation, ideal for **question-answering and document retrieval**. |
|
|
|
|
|
- **Developed by:** fjmgAI |
|
- **License:** apache-2.0 |
|
|
|
[<img src="https://github.com/lightonai/pylate/blob/main/docs/img/logo.png?raw=true" width="200"/>](https://github.com/lightonai/pylate) |