A curated subset of the datasets that work out of the box with Sentence Transformers: https://huggingface.co/datasets?other=sentence-transformers
AI & ML interests
In the following you find models tuned to be used for sentence / text embedding generation. They can be used with the sentence-transformers package.
Recent Activity
View all activity
These datasets contain MS MARCO Triplets gathered by mining hard negatives using various models. Each dataset has various subsets.
-
sentence-transformers/msmarco-scores-ms-marco-MiniLM-L6-v2
Viewer • Updated • 241M • 223 • 1 -
sentence-transformers/msmarco-msmarco-distilbert-base-tas-b
Viewer • Updated • 86.3M • 933 • 4 -
sentence-transformers/msmarco-msmarco-distilbert-base-v3
Viewer • Updated • 88.9M • 1.33k • 3 -
sentence-transformers/msmarco-msmarco-MiniLM-L6-v3
Viewer • Updated • 80.6M • 274 • 2
These datasets all have "english" and "non_english" columns for numerous datasets. They can be used to make embedding models multilingual.
-
sentence-transformers/parallel-sentences-wikititles
Viewer • Updated • 14.7M • 76 • 1 -
sentence-transformers/parallel-sentences-tatoeba
Viewer • Updated • 8.35M • 5.25k -
sentence-transformers/parallel-sentences-talks
Viewer • Updated • 19.6M • 7.07k • 12 -
sentence-transformers/parallel-sentences-europarl
Viewer • Updated • 49.7M • 605 • 1
NanoBEIR by Zeta Alpha, extended with BM25 scores. These datasets are used in the Sentence Transformers Cross Encoder NanoBEIR Evaluator.
A curated subset of the datasets that work out of the box with Sentence Transformers: https://huggingface.co/datasets?other=sentence-transformers
These datasets all have "english" and "non_english" columns for numerous datasets. They can be used to make embedding models multilingual.
-
sentence-transformers/parallel-sentences-wikititles
Viewer • Updated • 14.7M • 76 • 1 -
sentence-transformers/parallel-sentences-tatoeba
Viewer • Updated • 8.35M • 5.25k -
sentence-transformers/parallel-sentences-talks
Viewer • Updated • 19.6M • 7.07k • 12 -
sentence-transformers/parallel-sentences-europarl
Viewer • Updated • 49.7M • 605 • 1
These datasets contain MS MARCO Triplets gathered by mining hard negatives using various models. Each dataset has various subsets.
-
sentence-transformers/msmarco-scores-ms-marco-MiniLM-L6-v2
Viewer • Updated • 241M • 223 • 1 -
sentence-transformers/msmarco-msmarco-distilbert-base-tas-b
Viewer • Updated • 86.3M • 933 • 4 -
sentence-transformers/msmarco-msmarco-distilbert-base-v3
Viewer • Updated • 88.9M • 1.33k • 3 -
sentence-transformers/msmarco-msmarco-MiniLM-L6-v3
Viewer • Updated • 80.6M • 274 • 2
NanoBEIR by Zeta Alpha, extended with BM25 scores. These datasets are used in the Sentence Transformers Cross Encoder NanoBEIR Evaluator.