ModernBERT-embed-large
ModernBERT-embed-large is an embedding model trained from ModernBERT-large, bringing the new advances of ModernBERT to embeddings!
Indeed, ModernBERT is a base model trained for Masked Language Modeling and can not directly be used to perform tasks such as retrieval without further fine-tuning.
ModernBERT-embed-large is fine-tuned on the Nomic Embed weakly-supervised and supervised datasets and also supports Matryoshka Representation Learning dimensions of 256 to reduce memory with minimal performance loss.
Performance
Model | Dimensions | Average (56) | Classification (12) | Clustering (11) | Pair Classification (3) | Reranking (4) | Retrieval (15) | STS (10) | Summarization (1) |
---|---|---|---|---|---|---|---|---|---|
nomic-embed-text-v1.5 | 768 | 62.28 | 73.55 | 43.93 | 84.61 | 55.78 | 53.01 | 81.94 | 30.4 |
modernbert-embed-base | 768 | 62.62 | 74.31 | 44.98 | 83.96 | 56.42 | 52.89 | 81.78 | 31.39 |
modernbert-embed-large | 1024 | 63,84 | 75.03 | 46.04 | 85.31 | 57.64 | 54.36 | 83.80 | 28.31 |
nomic-embed-text-v1.5 | 256 | 61.04 | 72.1 | 43.16 | 84.09 | 55.18 | 50.81 | 81.34 | 30.05 |
modernbert-embed-base | 256 | 61.17 | 72.40 | 43.82 | 83.45 | 55.69 | 50.62 | 81.12 | 31.27 |
modernbert-embed-large | 256 | 62.43 | 73.60 | 44.59 | 84.89 | 57.08 | 51.72 | 83.46 | 29.03 |
Usage
You can use these models directly with the latest transformers release and requires installing transformers>=4.48.0
:
pip install transformers>=4.48.0
Reminder, this model is trained similarly to Nomic Embed and REQUIRES prefixes to be added to the input. For more information, see the instructions in Nomic Embed.
Most use cases, adding search_query:
to the query and search_document:
to the documents will be sufficient.
Sentence Transformers
from sentence_transformers import SentenceTransformer
model = SentenceTransformer("lightonai/modernbert-embed-large")
query_embeddings = model.encode([
"search_query: What is TSNE?",
"search_query: Who is Laurens van der Maaten?",
])
doc_embeddings = model.encode([
"search_document: TSNE is a dimensionality reduction algorithm created by Laurens van Der Maaten",
])
print(query_embeddings.shape, doc_embeddings.shape)
# (2, 1024) (1, 1024)
similarities = model.similarity(query_embeddings, doc_embeddings)
print(similarities)
# tensor([[0.6518],
# [0.4237]])
Click to see Sentence Transformers usage with Matryoshka Truncation
In Sentence Transformers, you can truncate embeddings to a smaller dimension by using the truncate_dim
parameter when loading the SentenceTransformer
model.
from sentence_transformers import SentenceTransformer
model = SentenceTransformer("lightonai/modernbert-embed-large", truncate_dim=256)
query_embeddings = model.encode([
"search_query: What is TSNE?",
"search_query: Who is Laurens van der Maaten?",
])
doc_embeddings = model.encode([
"search_document: TSNE is a dimensionality reduction algorithm created by Laurens van Der Maaten",
])
print(query_embeddings.shape, doc_embeddings.shape)
# (2, 256) (1, 256)
similarities = model.similarity(query_embeddings, doc_embeddings)
print(similarities)
# tensor([[0.6835],
# [0.3982]])
Note the small differences compared to the full 1024-dimensional similarities.
Transformers
import torch
import torch.nn.functional as F
from transformers import AutoTokenizer, AutoModel
def mean_pooling(model_output, attention_mask):
token_embeddings = model_output[0]
input_mask_expanded = (
attention_mask.unsqueeze(-1).expand(token_embeddings.size()).float()
)
return torch.sum(token_embeddings * input_mask_expanded, 1) / torch.clamp(
input_mask_expanded.sum(1), min=1e-9
)
queries = ["search_query: What is TSNE?", "search_query: Who is Laurens van der Maaten?"]
documents = ["search_document: TSNE is a dimensionality reduction algorithm created by Laurens van Der Maaten"]
tokenizer = AutoTokenizer.from_pretrained("lightonai/modernbert-embed-large")
model = AutoModel.from_pretrained("lightonai/modernbert-embed-large")
encoded_queries = tokenizer(queries, padding=True, truncation=True, return_tensors="pt")
encoded_documents = tokenizer(documents, padding=True, truncation=True, return_tensors="pt")
with torch.no_grad():
queries_outputs = model(**encoded_queries)
documents_outputs = model(**encoded_documents)
query_embeddings = mean_pooling(queries_outputs, encoded_queries["attention_mask"])
query_embeddings = F.normalize(query_embeddings, p=2, dim=1)
doc_embeddings = mean_pooling(documents_outputs, encoded_documents["attention_mask"])
doc_embeddings = F.normalize(doc_embeddings, p=2, dim=1)
print(query_embeddings.shape, doc_embeddings.shape)
# torch.Size([2, 1024]) torch.Size([1, 1024])
similarities = query_embeddings @ doc_embeddings.T
print(similarities)
# tensor([[0.6518],
# [0.4237]])
Click to see Transformers usage with Matryoshka Truncation
In transformers
, you can truncate embeddings to a smaller dimension by slicing the mean pooled embeddings, prior to normalization.
import torch
import torch.nn.functional as F
from transformers import AutoTokenizer, AutoModel
def mean_pooling(model_output, attention_mask):
token_embeddings = model_output[0]
input_mask_expanded = (
attention_mask.unsqueeze(-1).expand(token_embeddings.size()).float()
)
return torch.sum(token_embeddings * input_mask_expanded, 1) / torch.clamp(
input_mask_expanded.sum(1), min=1e-9
)
queries = ["search_query: What is TSNE?", "search_query: Who is Laurens van der Maaten?"]
documents = ["search_document: TSNE is a dimensionality reduction algorithm created by Laurens van Der Maaten"]
tokenizer = AutoTokenizer.from_pretrained(".")
model = AutoModel.from_pretrained(".")
truncate_dim = 256
encoded_queries = tokenizer(queries, padding=True, truncation=True, return_tensors="pt")
encoded_documents = tokenizer(documents, padding=True, truncation=True, return_tensors="pt")
with torch.no_grad():
queries_outputs = model(**encoded_queries)
documents_outputs = model(**encoded_documents)
query_embeddings = mean_pooling(queries_outputs, encoded_queries["attention_mask"])
query_embeddings = query_embeddings[:, :truncate_dim]
query_embeddings = F.normalize(query_embeddings, p=2, dim=1)
doc_embeddings = mean_pooling(documents_outputs, encoded_documents["attention_mask"])
doc_embeddings = doc_embeddings[:, :truncate_dim]
doc_embeddings = F.normalize(doc_embeddings, p=2, dim=1)
print(query_embeddings.shape, doc_embeddings.shape)
# torch.Size([2, 256]) torch.Size([1, 256])
similarities = query_embeddings @ doc_embeddings.T
print(similarities)
# tensor([[0.6835],
# [0.3982]])
Note the small differences compared to the full 1024-dimensional similarities.
Transformers.js
If you haven't already, you can install the Transformers.js JavaScript library from NPM using:
npm i @huggingface/transformers
Then, you can compute embeddings as follows:
import { pipeline, matmul } from '@huggingface/transformers';
// Create a feature extraction pipeline
const extractor = await pipeline(
"feature-extraction",
"lightonai/modernbert-embed-large",
{ dtype: "fp32" }, // Supported options: "fp32", "fp16", "q8", "q4", "q4f16"
);
// Embed queries and documents
const query_embeddings = await extractor([
"search_query: What is TSNE?",
"search_query: Who is Laurens van der Maaten?",
], { pooling: "mean", normalize: true },
);
const doc_embeddings = await extractor([
"search_document: TSNE is a dimensionality reduction algorithm created by Laurens van Der Maaten",
], { pooling: "mean", normalize: true },
);
// Compute similarity scores
const similarities = await matmul(query_embeddings, doc_embeddings.transpose(1, 0));
console.log(similarities.tolist());
Training
We train ModernBERT-embed-large using a multi-stage training pipeline. Starting from the pretrained ModernBERT-large model, the first unsupervised contrastive stage trains on a dataset generated from weakly related text pairs, such as question-answer pairs from forums like StackExchange and Quora, title-body pairs from Amazon reviews, and summarizations from news articles.
In the second finetuning stage, higher quality labeled datasets such as search queries and answers from web searches are leveraged. Data curation and hard-example mining is crucial in this stage.
For more details, see the Nomic Embed Technical Report and corresponding blog post.
Training data to train the models is released in its entirety. For more details, see the contrastors
repository
Acknowledgment
We wanted to thank Zach Nussbaum from Nomic AI for building and sharing the Nomic Embed recipe and tools and its support during the training of this model!
The training has been run on Orange Business Cloud Avenue infrastructure.
Citation
If you find the model, dataset, or training code useful, please considering citing ModernBERT as well as Nomic Embed:
@misc{modernbert,
title={Smarter, Better, Faster, Longer: A Modern Bidirectional Encoder for Fast, Memory Efficient, and Long Context Finetuning and Inference},
author={Benjamin Warner and Antoine Chaffin and Benjamin Clavié and Orion Weller and Oskar Hallström and Said Taghadouini and Alexis Gallagher and Raja Biswas and Faisal Ladhak and Tom Aarsen and Nathan Cooper and Griffin Adams and Jeremy Howard and Iacopo Poli},
year={2024},
eprint={2412.13663},
archivePrefix={arXiv},
primaryClass={cs.CL},
url={https://arxiv.org/abs/2412.13663},
}
@misc{nussbaum2024nomic,
title={Nomic Embed: Training a Reproducible Long Context Text Embedder},
author={Zach Nussbaum and John X. Morris and Brandon Duderstadt and Andriy Mulyar},
year={2024},
eprint={2402.01613},
archivePrefix={arXiv},
primaryClass={cs.CL}
}
And if you want to cite this fine-tuning in particular, please use:
@misc{ModernBERT-embed-large,
title={ModernBERT-embed-large},
author={Chaffin, Antoine},
url={https://huggingface.co/lightonai/modernbert-embed-large},
year={2025}
}
- Downloads last month
- 2,037
Model tree for lightonai/modernbert-embed-large
Base model
answerdotai/ModernBERT-largeEvaluation results
- accuracy on MTEB AmazonCounterfactualClassification (en)test set self-reported76.791
- ap on MTEB AmazonCounterfactualClassification (en)test set self-reported39.796
- f1 on MTEB AmazonCounterfactualClassification (en)test set self-reported70.696
- accuracy on MTEB AmazonPolarityClassificationtest set self-reported94.195
- ap on MTEB AmazonPolarityClassificationtest set self-reported91.751
- f1 on MTEB AmazonPolarityClassificationtest set self-reported94.192
- accuracy on MTEB AmazonReviewsClassification (en)test set self-reported47.664
- f1 on MTEB AmazonReviewsClassification (en)test set self-reported46.933
- map_at_1 on MTEB ArguAnatest set self-reported25.178
- map_at_10 on MTEB ArguAnatest set self-reported41.088