Nick Doiron

monsoon-nlp

AI & ML interests

biology and multilingual models

Recent Activity

upvoted a collection 7 days ago
Llama 4
View all activity

Organizations

BigScience Workshop's profile picture Spaces-explorers's profile picture BigCode's profile picture Blog-explorers's profile picture Scary Snake's profile picture Hugging Face Discord Community's profile picture

monsoon-nlp's activity

reacted to merterbak's post with 🔥 7 days ago
view post
Post
2925
Meta has unveiled its Llama 4 🦙 family of models, featuring native multimodality and mixture-of-experts architecture. Two model families are available now:
Models🤗: meta-llama/llama-4-67f0c30d9fe03840bc9d0164
Blog Post: https://ai.meta.com/blog/llama-4-multimodal-intelligence/
HF's Blog Post: https://huggingface.co/blog/llama4-release

- 🧠 Native Multimodality - Process text and images in a unified architecture
- 🔍 Mixture-of-Experts - First Llama models using MoE for incredible efficiency
- 📏 Super Long Context - Up to 10M tokens
- 🌐 Multilingual Power - Trained on 200 languages with 10x more multilingual tokens than Llama 3 (including over 100 languages with over 1 billion tokens each)

🔹 Llama 4 Scout
- 17B active parameters (109B total)
- 16 experts architecture
- 10M context window
- Fits on a single H100 GPU
- Beats Gemma 3, Gemini 2.0 Flash-Lite, and Mistral 3.1

🔹 Llama 4 Maverick
- 17B active parameters (400B total)
- 128 experts architecture
- It can fit perfectly on DGX H100(8x H100)
- 1M context window
- Outperforms GPT-4o and Gemini 2.0 Flash
- ELO score of 1417 on LMArena currently second best model on arena

🔹 Llama 4 Behemoth (Coming Soon)
- 288B active parameters (2T total)
- 16 experts architecture
- Teacher model for Scout and Maverick
- Outperforms GPT-4.5, Claude Sonnet 3.7, and Gemini 2.0 Pro on STEM benchmarks
posted an update 12 days ago
reacted to daavoo's post with 👀 28 days ago
reacted to clem's post with 🚀 30 days ago
view post
Post
4631
We just crossed 1,500,000 public models on Hugging Face (and 500k spaces, 330k datasets, 50k papers). One new repository is created every 15 seconds. Congratulations all!
·
reacted to Yehor's post with 👍 about 1 month ago
replied to ashercn97's post about 1 month ago
view reply

I would say, sort by "Mean (task)" and pick one of those. Or if you can, compare three of the best on your data. That holds unless you need a longer context, or you are in medical or similar field where there are domain-specific models

posted an update about 1 month ago
view post
Post
3209
Genetic counselors help patients get 🧬 tests and understand their results. They need to study inheritance of several conditions, statistics, and patient care 🤓⚕️. I compiled 225 multiple-choice questions for the ABGC exam into a dataset: monsoon-nlp/genetic-counselor-multiple-choice
Llama 3.1 8B Instruct gets a 51% score.
I'm also creating a dataset of real-world open-ended questions (starting with Reddit) and am open to contributors
reacted to MohamedRashad's post with 🧠 2 months ago
reacted to davanstrien's post with ❤️ 4 months ago
view post
Post
3275
🇸🇰 Hovorte po slovensky? Help build better AI for Slovak!

We only need 90 more annotations to include Slovak in the next Hugging Face FineWeb2-C dataset ( data-is-better-together/fineweb-c) release!

Your contribution will help create better language models for 5+ million Slovak speakers.

Annotate here: data-is-better-together/fineweb-c.

Read more about why we're doing it: https://huggingface.co/blog/davanstrien/fineweb2-community
  • 3 replies
·
reacted to MohamedRashad's post with 🚀 4 months ago
reacted to fdaudens's post with 🤗 5 months ago
view post
Post
1956
🦋 Hug the butterfly! You can now add your Bluesky handle to your Hugging Face profile! ✨
reacted to m-ric's post with 😎 5 months ago
view post
Post
1831
I'm very proud to have supported @CGIAR and @Digigreen in making http://Farmer.chat, an app that supports 20k smallholder farmers on a daily basis 🌾

There are ~500 million smallholder farmers globally, playing a critical role in global food security. Having access to accurate information is essential for them.

💬 An “agricultural extension service” offers technical advice on agriculture, and also supplies farmers with the necessary inputs and services to support their agricultural production.

But agriculture extension agents are not in large enough numbers to cope with all the requests, especially in countries like Kenya, India, Ethiopia, and Nigeria.

🚀 So the team set out to build an app called http://Farmer.Chat, to provide an agricultural extension service, by building on the immense knowledge accumulated by CGIAR.

✨ The app is technically impressive: behind the Whatsapp-type UX, an agent interprets the user's intent, and identifies which tool to call to best answer their request: weather API, RAG on a CGIAR-provided knowledge base, market data, etc. The RAG on the knowledge base is in itself a work of art.

🎯 A key part of building such a complex system is to be able to evaluate it properly. During our bi-weekly sessions with the team, I could support them in implementing the method called "LLM-as-a-judge" to tackle this problem.

It worked really well : thanks to the amazing work of the team, the app now successfully answered over 300 thousand requests, in 6 different languages, and it keeps growing!

➡️ @Vinsingh , @rajgreen and I just wrote a blog post to describe how the app works, especially the LLM-as-a-judge system!

Read it here 👉 https://huggingface.co/blog/digital-green-llm-judge
reacted to Tonic's post with 👀 6 months ago
view post
Post
858
🙋🏻‍♂️ hey there folks ,

really enjoying sharing cool genomics and protein datasets on the hub these days , check out our cool new org : seq-to-pheno

scroll down for the datasets, still figuring out how to optimize for discoverability , i do think on that part it will be better than zenodo[dot}org , it would be nice to write a tutorial about that and compare : we already have more downloads than most zenodo datasets from famous researchers !
reacted to nyuuzyou's post with 👀 6 months ago
view post
Post
1564
🎙 Introducing LiveATC Recordings (Partial 2024-08-26) Dataset - nyuuzyou/liveatc

Dataset highlights:

- 21,172 air traffic control audio recordings from LiveATC.net for August 26, 2024
- Multilingual content, primarily in English with potential for other languages
- Each entry includes: audio file, ICAO airport code, facility type, date, and time
- Contains original MP3 files stored in .tar.zst archives, organized by ICAO airport code
- Data covers various airports and ATC facilities worldwide
- Subject to LiveATC.net's Terms of Use for personal, non-commercial use only

The dataset can be used for audio classification, automatic speech recognition, and analysis of air traffic control communications. The inclusion of recordings from multiple airports allows for comparative analysis across different locations and facility types.
reacted to Tonic's post with 👀 6 months ago
view post
Post
1870
🙋🏻‍♂️ Hey there folks ,

🦎Salamandra release by @mvillegas and team
@BSC_CNS BSC-LT is absolutely impressive so far !

perhaps the largest single training dataset of high quality text to date of 7.8 trillion tokens in 35 European languages and code.

the best part : the data was correctly licenced so it's actually future-proof!

the completions model is really creative and instruct fine tuned version is very good also.

now you can use such models for multi-lingual enterprise applications with further finetunes , long response generation, structured outputs (coding) also works.

check out 👇🏻
the collection : BSC-LT/salamandra-66fc171485944df79469043a
the repo : https://github.com/langtech-bsc/salamandra
7B-Instruct demo : Tonic/Salamandra-7B
reacted to clem's post with 🚀 6 months ago
view post
Post
3719
Very few people realize that most of the successful AI startups got successful because they were focused on open science and open-source for at least their first few years. To name but a few, OpenAI (GPT, GPT2 was open-source), Runway & Stability (stable diffusion), Cohere, Mistral and of course Hugging Face!

The reasons are not just altruistic, it's also because sharing your science and your models pushes you to build AI faster (which is key in a fast-moving domain like AI), attracts the best scientists & engineers and generates much more visibility, usage and community contributions than if you were 100% closed-source. The same applies to big tech companies as we're seeing with Meta and Google!

More startups and companies should release research & open-source AI, it's not just good for the world but also increases their probability of success!
·
reacted to pain's post with ❤️ 7 months ago
reacted to ezgikorkmaz's post with 👀 7 months ago
reacted to Tonic's post with 🚀 7 months ago
view post
Post
2529
🙋🏻‍♂️hey there folks ,

✒️InkubaLM has been trained from scratch using 1.9 billion tokens of data for five African languages, along with English and French data, totaling 2.4 billion tokens of data. It is capable of understanding and generating content in five African languages: Swahili, Yoruba, Hausa, isiZulu, and isiXhosa, as well as English and French.

model lelapa/InkubaLM-0.4B
demo Tonic/Inkuba-0.4B
reacted to ucsahin's post with 🔥 8 months ago
view post
Post
3913
🚀 Introducing TraVisionLM: Turkish Visual Language Model - The First of Its Kind! 🇹🇷🖼️

I'm thrilled to share TraVisionLM on Hugging Face! With 875M parameters, this lightweight, efficient model handles Turkish instructions for image inputs. Fully compatible with the Transformers library, it’s easy to load, fine-tune, and use—no external libraries needed!

Developed solo, TraVisionLM is a strong foundation for low-resource language research. While still improving, it's a key step for Turkish-language AI. Your feedback is welcome as I refine the model.

🎉 Explore it now:

- Model: ucsahin/TraVisionLM-base
- Demo: https://huggingface.co/spaces/ucsahin/TraVisionLM-Turkish_Visual_Language_Model
- Object Detection Finetune: ucsahin/TraVisionLM-Object-Detection-ft

Let’s push Turkish visual language processing forward!

---

🚀 TraVisionLM: Türünün İlk Örneği Türkçe Görsel Dil Modelini Sunuyorum! 🇹🇷🖼️

TraVisionLM modelini Hugging Face'te yayınladım! 875M parametre ile bu hafif ve verimli model, görüntüye dayalı Türkçe talimatları işlemek için tasarlandı. Transformers kütüphanesiyle tamamen uyumlu, yüklemesi, eğitmesi ve kullanması çok kolay—dış kütüphane gerekmez!

Tek başıma geliştirdiğim TraVisionLM, düşük kaynaklı dillerde araştırmalar için sağlam bir temel sunuyor. Geliştirmeye devam ederken geri bildirimlerinizi bekliyorum.

🎉 Hemen keşfedin:

- Model: ucsahin/TraVisionLM-base
- Demo: https://huggingface.co/spaces/ucsahin/TraVisionLM-Turkish_Visual_Language_Model
- Obje Tespiti İnce Ayarı: ucsahin/TraVisionLM-Object-Detection-ft

Türkçe görsel dil işleme sınırlarını birlikte zorlayalım!
  • 3 replies
·