mteb/leaderboard · Retrieval benchmark datasets default

May 13

Hello,

I would like to ask why following datasets were chosen as default for computing the Retrieval score

(only Retrieval selected as task type)

Retrieval
- StackOverflowQA, p2p
- TwitterHjerneRetrieval, p2p
- AILAStatutes, p2p
- ArguAna, s2p
- LegalBenchCorporateLobbying, s2p
- LEMBPasskeyRetrieval, s2p
- SCIDOCS, s2p
- SpartQA, s2s
- TempReasonL1, s2s
- TRECCOVID, s2p
- WinoGrande, s2s
- BelebeleRetrieval, s2p, multilingual 376 / 376 Subsets
- MIRACLRetrievalHardNegatives, s2p, multilingual 18 / 18 Subsets
- MLQARetrieval, s2p, multilingual 49 / 49 Subsets
- StatcanDialogueDatasetRetrieval, s2p, multilingual 2 / 2 Subsets
- WikipediaRetrievalMultilingual, s2p, multilingual 16 / 16 Subsets
- CovidRetrieval, s2p

koleckar

Jun 4

Hello, could somebody please answer the question? We still cant wrap our heads around the defaults. Is this really the base for the numbers displayed in the leaderboard for IR?

tomaarsen

Massive Text Embedding Benchmark org Jun 4

Hello!
Yes, those are the individual tasks used for the Retrieval task type for the Multilingual Benchmark. See also the paper: https://arxiv.org/abs/2502.13595
You'll get different tasks if you select different benchmarks.

Tom Aarsen

KennethEnevoldsen

Massive Text Embedding Benchmark org Jun 4

Hi @koleckar ,

We also outline the task selection procedure in Section 2.4

Sorry for missing this, however, we generally refer comments to the GitHub repository - you will get faster answers over there :)