cadet-embed-base-v1

cadet-embed-base-v1 is a BERT-base embedding model fine-tuned from intfloat/e5-base-unsupervised with

  • cross-encoder listwise distillation (teachers: RankT5-3B and BAAI/bge-reranker-v2.5-gemma2-lightweight)
  • purely synthetic queries (Llama-3.1 8B generated: questions, claims, titles, keywords, zero-shot & few-shot web queries) over 400k passages total from MSMARCO, DBPedia and Wikipedia corpora.

The result: highly effective BERT-base retrieval.

We provide our training code and scripts to generate synthetic queries at https://github.com/manveertamber/cadet-dense-retrieval.


Quick start

from sentence_transformers import SentenceTransformer
import numpy as np

model = SentenceTransformer("manveertamber/cadet-embed-base-v1")

query = "query: capital of France"

passages = [
    "passage: Paris is the capital and largest city of France.",
    "passage: Berlin is known for its vibrant art scene.",
    "passage: The Eiffel Tower is located in Paris, France."
]

# Encode
q_emb   = model.encode(query,    normalize_embeddings=True)
p_embs  = model.encode(passages, normalize_embeddings=True)     # shape (n_passages, dim)

scores = np.dot(p_embs, q_emb)                                  # shape (n_passages,)

# Rank passages by score
for passage, score in sorted(zip(passages, scores), key=lambda x: x[1], reverse=True):
    print(f"{score:.3f}\t{passage}")

If you use this model, please cite:

@article{tamber2025conventionalcontrastivelearningfalls,
  title={Conventional Contrastive Learning Often Falls Short: Improving Dense Retrieval with Cross-Encoder Listwise Distillation and Synthetic Data}, 
  author={Manveer Singh Tamber and Suleman Kazi and Vivek Sourabh and Jimmy Lin},
  journal={arXiv:2505.19274},
  year={2025}
}
Downloads last month
516
Safetensors
Model size
109M params
Tensor type
F32
ยท
Inference Providers NEW
This model isn't deployed by any Inference Provider. ๐Ÿ™‹ Ask for provider support

Model tree for manveertamber/cadet-embed-base-v1

Finetuned
(3)
this model