
Content
Introduction
mdbr-leaf-ir
is a compact high-performance text embedding model specifically designed for information retrieval (IR) tasks, e.g., the retrieval stage of Retrieval-Augmented Generation (RAG) pipelines.
To enable even greater efficiency, mdbr-leaf-ir
supports flexible asymmetric architectures and is robust to vector quantization and MRL truncation.
If you are looking to perform other tasks such as classification, clustering, semantic sentence similarity, summarization, please check out our mdbr-leaf-mt
model.
Note: this model has been developed by the ML team of MongoDB Research. At the time of writing it is not used in any of MongoDB's commercial product or service offerings.
Technical Report
A technical report detailing our proposed LEAF
training procedure will be available soon (link will be added here).
Highlights
- State-of-the-Art Performance:
mdbr-leaf-ir
achieves state-of-the-art results for compact embedding models, ranking #1 on the public BEIR benchmark leaderboard for models with โค100M parameters. - Flexible Architecture Support:
mdbr-leaf-ir
supports asymmetric retrieval architectures enabling even greater retrieval results. See below for more information. - MRL and Quantization Support: embedding vectors generated by
mdbr-leaf-ir
compress well when truncated (MRL) and can be stored using more efficient types likeint8
andbinary
. See below for more information.
Benchmark Comparison
The table below shows the average BEIR benchmark scores (nDCG@10) for mdbr-leaf-ir
compared to other retrieval models.
mdbr-leaf-ir
ranks #1 on the BEIR public leaderboard, and when run in asymmetric "(asym.)" mode as described here, the results improve even further.
Model | Size | BEIR Avg. (nDCG@10) |
---|---|---|
OpenAI text-embedding-3-large | Unknown | 55.43 |
mdbr-leaf-ir (asym.) | 23M | 54.03 |
mdbr-leaf-ir | 23M | 53.55 |
snowflake-arctic-embed-s | 32M | 51.98 |
bge-small-en-v1.5 | 33M | 51.65 |
OpenAI text-embedding-3-small | Unknown | 51.08 |
granite-embedding-small-english-r2 | 47M | 50.87 |
snowflake-arctic-embed-xs | 23M | 50.15 |
e5-small-v2 | 33M | 49.04 |
SPLADE++ | 110M | 48.88 |
MiniLM-L6-v2 | 23M | 41.95 |
BM25 | โ | 41.14 |
Quickstart
Sentence Transformers
from sentence_transformers import SentenceTransformer
# Load the model
model = SentenceTransformer("MongoDB/mdbr-leaf-ir")
# Example queries and documents
queries = [
"What is machine learning?",
"How does neural network training work?"
]
documents = [
"Machine learning is a subset of artificial intelligence that focuses on algorithms that can learn from data.",
"Neural networks are trained through backpropagation, adjusting weights to minimize prediction errors."
]
# Encode queries and documents
query_embeddings = model.encode(queries, prompt_name="query")
document_embeddings = model.encode(documents)
# Compute similarity scores
scores = model.similarity(query_embeddings, document_embeddings)
# Print results
for i, query in enumerate(queries):
print(f"Query: {query}")
for j, doc in enumerate(documents):
print(f" Similarity: {scores[i, j]:.4f} | Document {j}: {doc[:80]}...")
# Query: What is machine learning?
# Similarity: 0.6857 | Document 0: Machine learning is a subset of ...
# Similarity: 0.4598 | Document 1: Neural networks are trained ...
#
# Query: How does neural network training work?
# Similarity: 0.4238 | Document 0: Machine learning is a subset of ...
# Similarity: 0.5723 | Document 1: Neural networks are trained ...
Transformers Usage
See full example notebook here.
Asymmetric Retrieval Setup
mdbr-leaf-ir
is aligned to snowflake-arctic-embed-m-v1.5
, the model it has been distilled from. This enables flexible architectures in which, for example, documents are encoded using the larger model, while queries can be encoded faster and more efficiently with the compact leaf
model:
# Use mdbr-leaf-ir for query encoding (real-time, low latency)
query_model = SentenceTransformer("MongoDB/mdbr-leaf-ir")
query_embeddings = query_model.encode(queries, prompt_name="query")
# Use a larger model for document encoding (one-time, at index time)
doc_model = SentenceTransformer("Snowflake/snowflake-arctic-embed-m-v1.5")
document_embeddings = doc_model.encode(documents)
# Compute similarities
scores = query_model.similarity(query_embeddings, document_embeddings)
Retrieval results in asymmetric mode are often superior to the standard mode above.
MRL Truncation
Embeddings have been trained via MRL and can be truncated for more efficient storage:
from torch.nn import functional as F
query_embeds = model.encode(queries, prompt_name="query", convert_to_tensor=True)
doc_embeds = model.encode(documents, convert_to_tensor=True)
# Truncate and normalize according to MRL
query_embeds = F.normalize(query_embeds[:, :256], dim=-1)
doc_embeds = F.normalize(doc_embeds[:, :256], dim=-1)
similarities = model.similarity(query_embeds, doc_embeds)
print('After MRL:')
print(f"* Embeddings dimension: {query_embeds.shape[1]}")
print(f"* Similarities:\n\t{similarities}")
# After MRL:
# * Embeddings dimension: 256
# * Similarities:
# tensor([[0.7136, 0.4989],
# [0.4567, 0.6022]])
Vector Quantization
Vector quantization, for example to int8
or binary
, can be performed as follows:
Note: For vector quantization to types other than binary, we suggest performing a calibration to determine the optimal ranges, see here. Good initial values, according to the teacher model's documentation, are:
int8
: -0.3 and +0.3int4
: -0.18 and +0.18
from sentence_transformers.quantization import quantize_embeddings
import torch
query_embeds = model.encode(queries, prompt_name="query")
doc_embeds = model.encode(documents)
# Quantize embeddings to int8 using -0.3 and +0.3 as calibration ranges
ranges = torch.tensor([[-0.3], [+0.3]]).expand(2, query_embeds.shape[1]).cpu().numpy()
query_embeds = quantize_embeddings(query_embeds, "int8", ranges=ranges)
doc_embeds = quantize_embeddings(doc_embeds, "int8", ranges=ranges)
# Calculate similarities; cast to int64 to avoid under/overflow
similarities = query_embeds.astype(int) @ doc_embeds.astype(int).T
print('After quantization:')
print(f"* Embeddings type: {query_embeds.dtype}")
print(f"* Similarities:\n{similarities}")
# After quantization:
# * Embeddings type: int8
# * Similarities:
# [[118022 79111]
# [ 72961 98333]]
Evaluation
Please see here.
Citation
If you use this model in your work, please cite:
@article{mdb_leaf,
title = {LEAF: Lightweight Embedding Alignment Knowledge Distillation Framework},
author = {Robin Vujanic and Thomas Rueckstiess},
year = {2025}
eprint = {TBD},
archiveprefix = {arXiv},
primaryclass = {FILL HERE},
url = {FILL HERE}
}
License
This model is released under Apache 2.0 License.
Contact
For questions or issues, please open an issue or pull request. You can also contact the MongoDB ML research team at [email protected].
- Downloads last month
- 83