Nandan Thakur

nthakur

AI & ML interests

NLP, IR, QA

Recent Activity

liked a model about 17 hours ago
Qwen/Qwen2.5-3B
updated a dataset about 19 hours ago
nthakur/bge-full-data-nv-embed
published a dataset about 19 hours ago
nthakur/bge-full-data-nv-embed
View all activity

Organizations

Castorini's profile picture BEIR's profile picture INCOME's profile picture Poison Texts's profile picture Databricks's profile picture MIRACL's profile picture Vectara's profile picture

Posts 1

view post
Post
3369
🦢 The SWIM-IR dataset contains 29 million text-retrieval training pairs across 27 diverse languages. It is one of the largest synthetic multilingual datasets generated using PaLM 2 on Wikipedia! 🔥🔥

SWIM-IR dataset contains three subsets :
- Cross-lingual:nthakur/swim-ir-cross-lingual
- Monolingual: nthakur/swim-ir-monolingual
- Indic Cross-lingual: nthakur/indic-swim-ir-cross-lingual

Check it out:
https://huggingface.co/collections/nthakur/swim-ir-dataset-662ddaecfc20896bf14dd9b7