Nandan Thakur's picture

Nandan Thakur

nthakur

·

https://thakur-nandan.github.io

AI & ML interests

NLP, IR, QA

Recent Activity

updated a dataset 16 days ago

nthakur/odyssey-20K

published a dataset 16 days ago

nthakur/odyssey-20K

updated a dataset 21 days ago

nthakur/odyssey-verified-27K-oracled-round-2

View all activity

Organizations

Posts 2

Post

1832

Last year, I curated & generated a few multilingual SFT and DPO datasets by translating English SFT/DPO datasets into 9-10 languages using the mistralai/Mistral-7B-Instruct-v0.2 model.

I hope it helps the community for pretraining/instruction tuning multilingual LLMs! I added a small diagram to briefly describe which datasets are added and their sources.

Happy to collaborate in either using these datasets for instruction FT, or wishes to extend translated versions of newer SFT/DPO english datasets!

nthakur/multilingual-sft-and-dpo-datasets-67eaf56fe3feca5a57cf7d74

Post

3750

🦢 The SWIM-IR dataset contains 29 million text-retrieval training pairs across 27 diverse languages. It is one of the largest synthetic multilingual datasets generated using PaLM 2 on Wikipedia! 🔥🔥

SWIM-IR dataset contains three subsets :
- Cross-lingual:nthakur/swim-ir-cross-lingual
- Monolingual: nthakur/swim-ir-monolingual
- Indic Cross-lingual: nthakur/indic-swim-ir-cross-lingual

Check it out:
https://huggingface.co/collections/nthakur/swim-ir-dataset-662ddaecfc20896bf14dd9b7

Collections 5

View 5 collections

Papers 16

arxiv:2508.06600

arxiv:2505.16967

arxiv:2504.20006

arxiv:2504.13128

models 43

nthakur/qwen3-4b-grpo-modified-10-docs-odyssey-27k-step-60

4B • Updated Oct 2, 2025 • 7

nthakur/qwen3-4b-grpo-round-2-modified-10-docs-step-160

4B • Updated Sep 25, 2025 • 9

nthakur/qwen3-4b-grpo-mix-1-1-1-step-165

4B • Updated Sep 16, 2025 • 174

nthakur/qwen3-4b-grpo-infoseek-mix-1-1-1-step-25

4B • Updated Sep 15, 2025 • 7

nthakur/qwen3-4b-grpo-mix-1-2-4-step-225

4B • Updated Sep 10, 2025 • 8

nthakur/qwen3-4b-grpo-10-docs-modified-mix-1-1-1-step-385

4B • Updated Sep 8, 2025 • 6

nthakur/qwen3-4b-grpo-only-odyssey-step-210

4B • Updated Aug 27, 2025 • 7

nthakur/baseline-qwen3-4b-grpo-nq-hotpotqa-step-200

4B • Updated Aug 27, 2025 • 6

nthakur/baseline-qwen3-4b-ppo-nq-hotpotqa-step-200

4B • Updated Aug 20, 2025 • 4

nthakur/Mistral-7B-Instruct-v0.2-mirage-bench-sft-teacher-mixtral

Updated Mar 31, 2025 • 11 • 1

datasets 64

nthakur/odyssey-20K

Viewer • Updated 16 days ago • 20.1k • 57

nthakur/odyssey-verified-27K-oracled-round-2

Viewer • Updated 21 days ago • 12.3k • 41

nthakur/odyssey-verified-hard-17K

Viewer • Updated Sep 16, 2025 • 17.5k • 5

nthakur/odyssey-verified-27K

Viewer • Updated Sep 13, 2025 • 27.1k • 35

nthakur/search-arena-v1-nuggets-with-urls-5k-qwen

Viewer • Updated Jul 29, 2025 • 5.1k • 9

nthakur/auto-browsecomp-18k

Viewer • Updated Jun 23, 2025 • 18k • 5

nthakur/auto-browsecomp-10k

Viewer • Updated Jun 17, 2025 • 9.88k • 5

nthakur/cornstack-6-langs-v1-tevatron-6M

Viewer • Updated Jun 3, 2025 • 5.92M • 300

nthakur/cornstack-php-v1-tevatron-1M

Viewer • Updated Jun 2, 2025 • 993k • 73

nthakur/cornstack-go-v1-tevatron-1M

Viewer • Updated May 30, 2025 • 995k • 77

View 64 datasets