Aymeric Roucher's picture

Aymeric Roucher

m-ric

·

http://aymeric-roucher.github.io

AI & ML interests

Leading Agents at Hugging Face 🤗

Recent Activity

posted an update 3 days ago

New king of open VLMs: InternVL3 takes Qwen 2.5's crown! 👑 InternVL have been a wildly successful series of model : and the latest iteration has just taken back their crown thanks to their superior, natively multimodal vision training pipeline. ➡️ Most of the vision language models (VLMs) these days are built like Frankenstein : take a good text-only Large Language Model (LLM) backbone, stitch a specific vision transformer (ViT) on top of it. Then the training is sequential 🔢 : 1. Freeze the LLM weights while you train the ViT only to work with the LLM part, then 2. Unfreeze all weights to train all weights in order to work together. 💫 The Shanghai Lab decided to challenge this paradigm and chose this approach that they call "native". For each of their model sizes, they still start from a good LLM (mostly Qwen-2.5 series, did I tell you I'm a huge fan of Qwen? ❤️), and stitch the ViT, but they don't freeze anything : they train all weights together with interleaved text and image understanding data in a single pre-training phase 🎨. They claim it results in more seamless interactions between modalities. And the results prove them right: they took the crown of top VLMs, at nearly all sizes, from their Qwen-2.5 parents. 👑

liked a model 3 days ago

OpenGVLab/InternVL3-78B

upvoted a paper 3 days ago

InternVL3: Exploring Advanced Training and Test-Time Recipes for Open-Source Multimodal Models

View all activity

Organizations

m-ric's activity

upvoted a paper 3 days ago

InternVL3: Exploring Advanced Training and Test-Time Recipes for Open-Source Multimodal Models

Paper • 2504.10479 • Published 6 days ago • 228

upvoted a collection 6 days ago

🪐 SmolLM

A series of smol LLMs: 135M, 360M and 1.7B. We release base and Instruct models as well as the training corpus and some WebGPU demos • 12 items • Updated 20 days ago • 222

upvoted an article 6 days ago

Article

SmolLM - blazingly fast and remarkably powerful

Jul 16, 2024

• 354

upvoted a paper 10 days ago

SmolVLM: Redefining small and efficient multimodal models

Paper • 2504.05299 • Published 13 days ago • 163

upvoted 2 papers 11 days ago

Improved Visual-Spatial Reasoning via R1-Zero-Like Training

Paper • 2504.00883 • Published 20 days ago • 60

One-Minute Video Generation with Test-Time Training

Paper • 2504.05298 • Published 13 days ago • 94

upvoted 2 articles 13 days ago

Article

Xet is on the Hub

Mar 18

• 47

Article

Welcome Llama 4 Maverick & Scout on Hugging Face!

16 days ago

• 140

upvoted a paper 13 days ago

SynWorld: Virtual Scenario Synthesis for Agentic Action Knowledge Refinement

Paper • 2504.03561 • Published 16 days ago • 17

upvoted a collection 13 days ago

Llama 4

Llama 4 release • 10 items • Updated 15 days ago • 438

upvoted a paper 14 days ago

JARVIS-VLA: Post-Training Large-Scale Vision Language Models to Play Visual Games with Keyboards and Mouse

Paper • 2503.16365 • Published Mar 20 • 39

upvoted a paper 20 days ago

UI-R1: Enhancing Action Prediction of GUI Agents by Reinforcement Learning

Paper • 2503.21620 • Published 24 days ago • 59

upvoted an article about 1 month ago

Article

LeRobot goes to driving school: World’s largest open-source self-driving dataset

Mar 11

• 76

upvoted 2 papers about 1 month ago

GUI-WORLD: A Dataset for GUI-oriented Multimodal LLM-based Agents

Paper • 2406.10819 • Published Jun 16, 2024 • 1

GUI Odyssey: A Comprehensive Dataset for Cross-App GUI Navigation on Mobile Devices

Paper • 2406.08451 • Published Jun 12, 2024 • 26