Dita: Scaling Diffusion Transformer for Generalist Vision-Language-Action Policy Paper β’ 2503.19757 β’ Published 14 days ago β’ 48
Embodied-Reasoner: Synergizing Visual Search, Reasoning, and Action for Embodied Interactive Tasks Paper β’ 2503.21696 β’ Published 12 days ago β’ 21
DAPO: An Open-Source LLM Reinforcement Learning System at Scale Paper β’ 2503.14476 β’ Published 21 days ago β’ 115
VisualWebInstruct: Scaling up Multimodal Instruction Data through Web Search Paper β’ 2503.10582 β’ Published 26 days ago β’ 21
KodCode: A Diverse, Challenging, and Verifiable Synthetic Dataset for Coding Paper β’ 2503.02951 β’ Published Mar 4 β’ 29
SigLIP 2: Multilingual Vision-Language Encoders with Improved Semantic Understanding, Localization, and Dense Features Paper β’ 2502.14786 β’ Published Feb 20 β’ 140
LongPO: Long Context Self-Evolution of Large Language Models through Short-to-Long Preference Optimization Paper β’ 2502.13922 β’ Published Feb 19 β’ 25
Boosting Multimodal Reasoning with MCTS-Automated Structured Thinking Paper β’ 2502.02339 β’ Published Feb 4 β’ 22
Ovis2 Collection Our latest advancement in multi-modal large language models (MLLMs) β’ 15 items β’ Updated 14 days ago β’ 59
Kimi k1.5: Scaling Reinforcement Learning with LLMs Paper β’ 2501.12599 β’ Published Jan 22 β’ 113
VideoLLaMA3 Collection Frontier Multimodal Foundation Models for Video Understanding β’ 14 items β’ Updated 28 days ago β’ 14
VideoLLaMA 3: Frontier Multimodal Foundation Models for Image and Video Understanding Paper β’ 2501.13106 β’ Published Jan 22 β’ 91
MMVU: Measuring Expert-Level Multi-Discipline Video Understanding Paper β’ 2501.12380 β’ Published Jan 21 β’ 86
REINFORCE++: A Simple and Efficient Approach for Aligning Large Language Models Paper β’ 2501.03262 β’ Published Jan 4 β’ 99
2.5 Years in Class: A Multimodal Textbook for Vision-Language Pretraining Paper β’ 2501.00958 β’ Published Jan 1 β’ 107
VideoRefer Suite: Advancing Spatial-Temporal Object Understanding with Video LLM Paper β’ 2501.00599 β’ Published Dec 31, 2024 β’ 48
PixMo Collection A set of vision-language datasets built by Ai2 and used to train the Molmo family of models. Read more at https://molmo.allenai.org/blog β’ 10 items β’ Updated 26 days ago β’ 68
Inf-CL Collection The corresponding demos/checkpoints/papers/datasets of Inf-CL. β’ 2 items β’ Updated 28 days ago β’ 3