ModernColBERT + InSeNT

arXiv GitHub Hugging Face

This is a contextual model finetuned from lightonai/GTE-ModernColBERT-v1 on the ConTEB training dataset. It was trained using the InSeNT training approach, detailed in the corresponding paper.

This experimental model stems from the paper Context is Gold to find the Gold Passage: Evaluating and Training Contextual Document Embeddings. While results are promising, we have seen regression on standard embedding tasks, and using it in production will probably require further work on extending the training set to improve robustness and OOD generalization.

Usage

Direct Usage

First install the contextual-embeddings package:

pip install git+https://github.com/illuin-tech/contextual-embeddings

To run inference with a contextual model, you can use the following examples:

from contextual_embeddings import LongContextEmbeddingModel
from pylate.models import ColBERT

documents = [
    [
        "The old lighthouse keeper trimmed his lamp, its beam cutting a lonely path through the fog.",
        "He remembered nights of violent storms, when the ocean seemed to swallow the sky whole.",
        "Still, he found comfort in his duty, a silent guardian against the treacherous sea."
    ],
    [
        "A curious fox cub, all rust and wonder, ventured out from its den for the first time.",
        "Each rustle of leaves, every chirping bird, was a new symphony to its tiny ears.",
        "Under the watchful eye of its mother, it began to learn the secrets of the whispering forest."
    ]
]
base_model = ColBERT("illuin-conteb/modern-colbert-insent")
contextual_model = LongContextEmbeddingModel(
    base_model=base_model,
    pooling_mode="tokens"
)
embeddings = contextual_model.embed_documents(documents)
print("Length of embeddings:", len(embeddings)) # 2
print("Length of first document embedding:", len(embeddings[0])) # 3
print(f"Shape of first chunk embedding: {embeddings[0][0].shape}") # torch.Size([22, 128])

Model Details

Model Description

  • Model Type: Sentence Transformer
  • Base model: lightonai/GTE-ModernColBERT-v1
  • Maximum Sequence Length: tokens
  • Output Dimensionality: 128 dimensions
  • Similarity Function: MaxSim
  • Training Dataset:
    • train

Model Sources

Full Model Architecture

ColBERT(
  (0): Transformer({'max_seq_length': 8192, 'do_lower_case': False}) with Transformer model: ModernBertModel 
  (1): Dense({'in_features': 768, 'out_features': 128, 'bias': False, 'activation_function': 'torch.nn.modules.linear.Identity'})
)

Citation

@misc{conti2025contextgoldgoldpassage,
      title={Context is Gold to find the Gold Passage: Evaluating and Training Contextual Document Embeddings}, 
      author={Max Conti and Manuel Faysse and Gautier Viaud and Antoine Bosselut and Céline Hudelot and Pierre Colombo},
      year={2025},
      eprint={2505.24782},
      archivePrefix={arXiv},
      primaryClass={cs.IR},
      url={https://arxiv.org/abs/2505.24782}, 
}
Downloads last month
3
Safetensors
Model size
149M params
Tensor type
BF16
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Collection including illuin-conteb/modern-colbert-insent