metadata
license: mit
datasets:
- mteb/twentynewsgroups-clustering
- mteb/biorxiv-clustering-s2s
- mteb/biorxiv-clustering-p2p
language:
- en
pipeline_tag: text-classification
library_name: sentence-transformers
tags:
- mteb
- text
- transformers
- text-embeddings-inference
- sparse-encoder
- sparse
- csr
model-index:
- name: NV-Embed-v2
results:
- dataset:
name: MTEB BiorxivClusteringP2P.v2
type: mteb/biorxiv_clustering_p2p
revision: f5dbc242e11dd8e24def4c4268607a49e02946dc
config: default
split: test
languages:
- eng-Latn
metrics:
- type: v_measure
value: 0.579338
- type: v_measure_std
value: 0.00337
- type: main_score
value: 0.579338
task:
type: Clustering
- dataset:
name: MTEB BiorxivClusteringS2S.v2
type: mteb/biorxiv_clustering_s2s
revision: eb4edb10386758d274cd161093eb351381a16dbf
config: default
split: test
languages:
- eng-Latn
metrics:
- type: v_measure
value: 0.540989
- type: v_measure_std
value: 0.005707
- type: main_score
value: 0.540989
task:
type: Clustering
- dataset:
name: MTEB TwentyNewsgroupsClustering
type: mteb/twenty_newsgroups_clustering
revision: 6125ec4e24fa026cec8a478383ee943acfbd5449
config: default
split: test
languages:
- eng-Latn
metrics:
- type: v_measure
value: 0.630936
- type: v_measure_std
value: 0.007942
- type: main_score
value: 0.007942
task:
type: Clustering
base_model:
- nvidia/NV-Embed-v2
For more details, including benchmark evaluation, hardware requirements, and inference performance, please refer to our Github.
Usage
📌 Tip: For NV-Embed-V2, using Transformers versions later than 4.47.0 may lead to performance degradation, as model_type=bidir_mistral in config.json is unsupported is no longer supported.
We recommend using Transformers 4.47.0.
Sentence Transformers Usage
You can evaluate this model loaded by Sentence Transformers with the following code snippet:
import mteb
from sentence_transformers import SparseEncoder
model = SparseEncoder(
"CSR-NV_Embed_v2-Clustering-Biorxiv_TwentyNews",
trust_remote_code=True
)
model.prompts = {
"BiorxivClusteringP2P.v2": "Instruct: Identify the main category of Biorxiv papers based on the titles and abstracts\nQuery:",
"BiorxivClusteringS2S.v2": "Instruct: Identify the main category of Biorxiv papers based on the titles\nQuery:",
"TwentyNewsgroupsClustering": "Instruct: Identify the topic or theme of the given news articles\nQuery:"
}
task = mteb.get_tasks(tasks=["BiorxivClusteringP2P.v2", "BiorxivClusteringS2S.v2", "TwentyNewsgroupsClustering"])
evaluation = mteb.MTEB(tasks=task)
evaluation.run(
model,
eval_splits=["test"],
output_folder="./results/clustering",
show_progress_bar=True
encode_kwargs={"convert_to_sparse_tensor": False, "batch_size": 8},
) # MTEB don't support sparse tensors yet, so we need to convert to dense tensors
Citation
@inproceedings{wenbeyond,
title={Beyond Matryoshka: Revisiting Sparse Coding for Adaptive Representation},
author={Wen, Tiansheng and Wang, Yifei and Zeng, Zequn and Peng, Zhong and Su, Yudi and Liu, Xinyang and Chen, Bo and Liu, Hongwei and Jegelka, Stefanie and You, Chenyu},
booktitle={Forty-second International Conference on Machine Learning}
}