Nandan Thakur's picture

Nandan Thakur

nthakur

·

https://thakur-nandan.github.io

AI & ML interests

NLP, IR, QA

Recent Activity

liked a model about 1 month ago

deepseek-ai/DeepSeek-V4-Pro

upvoted an article about 1 month ago

DenseOn with the LateOn: Open State-of-the-Art Single and Multi-Vector Models

updated a dataset about 1 month ago

orbit-ai/orbit-seeds

View all activity

Organizations

Posts 2

Post

1916

Last year, I curated & generated a few multilingual SFT and DPO datasets by translating English SFT/DPO datasets into 9-10 languages using the mistralai/Mistral-7B-Instruct-v0.2 model.

I hope it helps the community for pretraining/instruction tuning multilingual LLMs! I added a small diagram to briefly describe which datasets are added and their sources.

Happy to collaborate in either using these datasets for instruction FT, or wishes to extend translated versions of newer SFT/DPO english datasets!

nthakur/multilingual-sft-and-dpo-datasets-67eaf56fe3feca5a57cf7d74

Post

3838

🦢 The SWIM-IR dataset contains 29 million text-retrieval training pairs across 27 diverse languages. It is one of the largest synthetic multilingual datasets generated using PaLM 2 on Wikipedia! 🔥🔥

SWIM-IR dataset contains three subsets :
- Cross-lingual:nthakur/swim-ir-cross-lingual
- Monolingual: nthakur/swim-ir-monolingual
- Indic Cross-lingual: nthakur/indic-swim-ir-cross-lingual

Check it out:
https://huggingface.co/collections/nthakur/swim-ir-dataset-662ddaecfc20896bf14dd9b7

Collections 5

View 5 collections

Papers 18

arxiv:2604.01195

arxiv:2508.06600

arxiv:2505.16967

arxiv:2504.20006

models 36

nthakur/orbit-4b-asearcher-en-no-math-14K-step-75

4B • Updated Apr 20 • 6

nthakur/qwen3-4b-grpo-modified-5-docs-only-odyssey-step-135

4B • Updated Apr 20 • 6

nthakur/Mistral-7B-Instruct-v0.2-mirage-bench-sft-teacher-mixtral

Updated Mar 31, 2025 • 8 • 1

nthakur/Meta-Llama-3-8B-Instruct-mirage-bench-sft

Updated Mar 31, 2025 • 3

nthakur/Mistral-7B-Instruct-v0.2-mirage-bench-sft

Updated Mar 31, 2025 • 6

nthakur/Mistral-7B-Instruct-v0.2-multilingual-dpo-v1.0-v2

Updated Aug 23, 2024 • 3

nthakur/Mistral-7B-Instruct-v0.2-multilingual-dpo-v1.0-final

Updated Aug 13, 2024

nthakur/Meta-Llama-3-8B-Instruct-mirage-all-teacher-instruct-llama-3-sft

Updated Aug 13, 2024 • 2

nthakur/Mistral-7B-Instruct-v0.2-mirage-all-teacher-instruct-mistral-sft

Updated Aug 13, 2024 • 3

nthakur/Mistral-7B-Instruct-v0.2-multilingual-dpo-v1.0

Updated Aug 12, 2024

datasets 58

nthakur/mirage-bench-pairwise-judgments

Viewer • Updated Mar 19 • 299k • 752 • 1

nthakur/search-arena-v1-nuggets-with-urls-5k-qwen

Viewer • Updated Jul 29, 2025 • 5.1k • 8

nthakur/cornstack-6-langs-v1-tevatron-6M

Viewer • Updated Jun 3, 2025 • 5.92M • 114

nthakur/cornstack-php-v1-tevatron-1M

Viewer • Updated Jun 2, 2025 • 993k • 394

nthakur/cornstack-go-v1-tevatron-1M

Viewer • Updated May 30, 2025 • 995k • 189

nthakur/cornstack-javascript-v1-tevatron-1M

Viewer • Updated May 30, 2025 • 952k • 188

nthakur/cornstack-ruby-v1-tevatron-1M

Viewer • Updated May 30, 2025 • 989k • 139

nthakur/cornstack-java-v1-tevatron-1M

Viewer • Updated May 30, 2025 • 995k • 113

nthakur/cornstack-python-v1-tevatron-1M

Viewer • Updated May 29, 2025 • 994k • 199

nthakur/default-100K-test

Viewer • Updated May 26, 2025 • 19k • 13

View 58 datasets