metadata

tags:
  - ColBERT
  - PyLate
  - contextual-embeddings
  - sentence-transformers
  - sentence-similarity
  - feature-extraction
  - generated_from_trainer

ModernColBERT + InSeNT

This is a contextual model finetuned from lightonai/GTE-ModernColBERT-v1 on the ConTEB training dataset. It was trained using the InSeNT training approach, detailed in the corresponding paper.

This experimental model stems from the paper Context is Gold to find the Gold Passage: Evaluating and Training Contextual Document Embeddings. While results are promising, we have seen regression on standard embedding tasks, and using it in production will probably require further work on extending the training set to improve robustness and OOD generalization.

Usage

Direct Usage

First install the contextual-embeddings package:

pip install git+https://github.com/illuin-tech/contextual-embeddings

To run inference with a contextual model, you can use the following examples:

from contextual_embeddings import LongContextEmbeddingModel
from pylate.models import ColBERT

documents = [
    [
        "The old lighthouse keeper trimmed his lamp, its beam cutting a lonely path through the fog.",
        "He remembered nights of violent storms, when the ocean seemed to swallow the sky whole.",
        "Still, he found comfort in his duty, a silent guardian against the treacherous sea."
    ],
    [
        "A curious fox cub, all rust and wonder, ventured out from its den for the first time.",
        "Each rustle of leaves, every chirping bird, was a new symphony to its tiny ears.",
        "Under the watchful eye of its mother, it began to learn the secrets of the whispering forest."
    ]
]
base_model = ColBERT("illuin-conteb/modern-colbert-insent")
contextual_model = LongContextEmbeddingModel(
    base_model=base_model,
    pooling_mode="tokens"
)
embeddings = contextual_model.embed_documents(documents)
print("Length of embeddings:", len(embeddings)) # 2
print("Length of first document embedding:", len(embeddings[0])) # 3
print(f"Shape of first chunk embedding: {embeddings[0][0].shape}") # torch.Size([22, 128])

Model Details

Model Description

Model Type: Sentence Transformer
Base model: lightonai/GTE-ModernColBERT-v1
Maximum Sequence Length: tokens
Output Dimensionality: 128 dimensions
Similarity Function: MaxSim
Training Dataset:
- train

Model Sources

Repository: Contextual Embeddings
Hugging Face: Contextual Embeddings

Full Model Architecture

ColBERT(
  (0): Transformer({'max_seq_length': 8192, 'do_lower_case': False}) with Transformer model: ModernBertModel 
  (1): Dense({'in_features': 768, 'out_features': 128, 'bias': False, 'activation_function': 'torch.nn.modules.linear.Identity'})
)

Citation

@misc{conti2025contextgoldgoldpassage,
      title={Context is Gold to find the Gold Passage: Evaluating and Training Contextual Document Embeddings}, 
      author={Max Conti and Manuel Faysse and Gautier Viaud and Antoine Bosselut and Céline Hudelot and Pierre Colombo},
      year={2025},
      eprint={2505.24782},
      archivePrefix={arXiv},
      primaryClass={cs.IR},
      url={https://arxiv.org/abs/2505.24782}, 
}