Nandan Thakur

nthakur

AI & ML interests

NLP, IR, QA

Recent Activity

updated a collection 5 days ago
🏜️MIRAGE-Bench [NAACL'25]
View all activity

Organizations

Castorini's profile picture BEIR's profile picture INCOME's profile picture Poison Texts's profile picture Databricks's profile picture MIRACL's profile picture Vectara's profile picture

nthakur's activity

reacted to clem's post with 🔥 4 days ago
view post
Post
3822
Before 2020, most of the AI field was open and collaborative. For me, that was the key factor that accelerated scientific progress and made the impossible possible—just look at the “T” in ChatGPT, which comes from the Transformer architecture openly shared by Google.

Then came the myth that AI was too dangerous to share, and companies started optimizing for short-term revenue. That led many major AI labs and researchers to stop sharing and collaborating.

With OAI and sama now saying they're willing to share open weights again, we have a real chance to return to a golden age of AI progress and democratization—powered by openness and collaboration, in the US and around the world.

This is incredibly exciting. Let’s go, open science and open-source AI!
·
reacted to their post with 🔥 5 days ago
view post
Post
1509
Last year, I curated & generated a few multilingual SFT and DPO datasets by translating English SFT/DPO datasets into 9-10 languages using the mistralai/Mistral-7B-Instruct-v0.2 model.

I hope it helps the community for pretraining/instruction tuning multilingual LLMs! I added a small diagram to briefly describe which datasets are added and their sources.

Happy to collaborate in either using these datasets for instruction FT, or wishes to extend translated versions of newer SFT/DPO english datasets!

nthakur/multilingual-sft-and-dpo-datasets-67eaf56fe3feca5a57cf7d74