Spaces:
Running
on
CPU Upgrade
Retrieval benchmark datasets default
Hello,
I would like to ask why following datasets were chosen as default for computing the Retrieval score
(only Retrieval selected as task type)
Retrieval
- StackOverflowQA, p2p
- TwitterHjerneRetrieval, p2p
- AILAStatutes, p2p
- ArguAna, s2p
- LegalBenchCorporateLobbying, s2p
- LEMBPasskeyRetrieval, s2p
- SCIDOCS, s2p
- SpartQA, s2s
- TempReasonL1, s2s
- TRECCOVID, s2p
- WinoGrande, s2s
- BelebeleRetrieval, s2p, multilingual 376 / 376 Subsets
- MIRACLRetrievalHardNegatives, s2p, multilingual 18 / 18 Subsets
- MLQARetrieval, s2p, multilingual 49 / 49 Subsets
- StatcanDialogueDatasetRetrieval, s2p, multilingual 2 / 2 Subsets
- WikipediaRetrievalMultilingual, s2p, multilingual 16 / 16 Subsets
- CovidRetrieval, s2p
Hello, could somebody please answer the question? We still cant wrap our heads around the defaults. Is this really the base for the numbers displayed in the leaderboard for IR?
Hello!
Yes, those are the individual tasks used for the Retrieval task type for the Multilingual Benchmark. See also the paper: https://arxiv.org/abs/2502.13595
You'll get different tasks if you select different benchmarks.
- Tom Aarsen
Hi @koleckar ,
We also outline the task selection procedure in Section 2.4
Sorry for missing this, however, we generally refer comments to the GitHub repository - you will get faster answers over there :)