Veritas2025

Sparse Encoder Update

ac8b83b 5 months ago

3.7 kB

metadata

license: mit
datasets:
  - mteb/twentynewsgroups-clustering
  - mteb/biorxiv-clustering-s2s
  - mteb/biorxiv-clustering-p2p
language:
  - en
pipeline_tag: text-classification
library_name: sentence-transformers
tags:
  - mteb
  - text
  - transformers
  - text-embeddings-inference
  - sparse-encoder
  - sparse
  - csr
model-index:
  - name: NV-Embed-v2
    results:
      - dataset:
          name: MTEB BiorxivClusteringP2P.v2
          type: mteb/biorxiv_clustering_p2p
          revision: f5dbc242e11dd8e24def4c4268607a49e02946dc
          config: default
          split: test
          languages:
            - eng-Latn
        metrics:
          - type: v_measure
            value: 0.579338
          - type: v_measure_std
            value: 0.00337
          - type: main_score
            value: 0.579338
        task:
          type: Clustering
      - dataset:
          name: MTEB BiorxivClusteringS2S.v2
          type: mteb/biorxiv_clustering_s2s
          revision: eb4edb10386758d274cd161093eb351381a16dbf
          config: default
          split: test
          languages:
            - eng-Latn
        metrics:
          - type: v_measure
            value: 0.540989
          - type: v_measure_std
            value: 0.005707
          - type: main_score
            value: 0.540989
        task:
          type: Clustering
      - dataset:
          name: MTEB TwentyNewsgroupsClustering
          type: mteb/twenty_newsgroups_clustering
          revision: 6125ec4e24fa026cec8a478383ee943acfbd5449
          config: default
          split: test
          languages:
            - eng-Latn
        metrics:
          - type: v_measure
            value: 0.630936
          - type: v_measure_std
            value: 0.007942
          - type: main_score
            value: 0.007942
        task:
          type: Clustering
base_model:
  - nvidia/NV-Embed-v2

For more details, including benchmark evaluation, hardware requirements, and inference performance, please refer to our Github.

Usage

📌 Tip: For NV-Embed-V2, using Transformers versions later than 4.47.0 may lead to performance degradation, as model_type=bidir_mistral in config.json is unsupported is no longer supported.

We recommend using Transformers 4.47.0.

Sentence Transformers Usage

You can evaluate this model loaded by Sentence Transformers with the following code snippet:

import mteb
from sentence_transformers import SparseEncoder
model = SparseEncoder(
    "CSR-NV_Embed_v2-Clustering-Biorxiv_TwentyNews",
    trust_remote_code=True
)
model.prompts = {
    "BiorxivClusteringP2P.v2": "Instruct: Identify the main category of Biorxiv papers based on the titles and abstracts\nQuery:",
    "BiorxivClusteringS2S.v2": "Instruct: Identify the main category of Biorxiv papers based on the titles\nQuery:",
    "TwentyNewsgroupsClustering": "Instruct: Identify the topic or theme of the given news articles\nQuery:"
}
task = mteb.get_tasks(tasks=["BiorxivClusteringP2P.v2", "BiorxivClusteringS2S.v2", "TwentyNewsgroupsClustering"])
evaluation = mteb.MTEB(tasks=task)
evaluation.run(
    model, 
    eval_splits=["test"], 
    output_folder="./results/clustering", 
    show_progress_bar=True
    encode_kwargs={"convert_to_sparse_tensor": False, "batch_size": 8},
)  # MTEB don't support sparse tensors yet, so we need to convert to dense tensors

Citation

@inproceedings{wenbeyond,
  title={Beyond Matryoshka: Revisiting Sparse Coding for Adaptive Representation},
  author={Wen, Tiansheng and Wang, Yifei and Zeng, Zequn and Peng, Zhong and Su, Yudi and Liu, Xinyang and Chen, Bo and Liu, Hongwei and Jegelka, Stefanie and You, Chenyu},
  booktitle={Forty-second International Conference on Machine Learning}
}