SANA-Sprint: One-Step Diffusion with Continuous-Time Consistency Distillation Paper • 2503.09641 • Published 5 days ago • 19
New Trends for Modern Machine Translation with Large Reasoning Models Paper • 2503.10351 • Published 4 days ago • 19
4D LangSplat: 4D Language Gaussian Splatting via Multimodal Large Language Models Paper • 2503.10437 • Published 4 days ago • 24
AnyAnomaly: Zero-Shot Customizable Video Anomaly Detection with LVLM Paper • 2503.04504 • Published 11 days ago • 2
S2S-Arena, Evaluating Speech2Speech Protocols on Instruction Following with Paralinguistic Information Paper • 2503.05085 • Published 11 days ago • 45
Audio Flamingo 2: An Audio-Language Model with Long-Audio Understanding and Expert Reasoning Abilities Paper • 2503.03983 • Published 12 days ago • 22
Token-Efficient Long Video Understanding for Multimodal LLMs Paper • 2503.04130 • Published 11 days ago • 80
Phi-4-Mini Technical Report: Compact yet Powerful Multimodal Language Models via Mixture-of-LoRAs Paper • 2503.01743 • Published 14 days ago • 73
TheoremExplainAgent: Towards Multimodal Explanations for LLM Theorem Understanding Paper • 2502.19400 • Published 19 days ago • 43
Step-Video-T2V Technical Report: The Practice, Challenges, and Future of Video Foundation Model Paper • 2502.10248 • Published about 1 month ago • 51
Qwen2-VL: Enhancing Vision-Language Model's Perception of the World at Any Resolution Paper • 2409.12191 • Published Sep 18, 2024 • 76
Moshi v0.1 Release Collection MLX, Candle & PyTorch model checkpoints released as part of the Moshi release from Kyutai. Run inference via: https://github.com/kyutai-labs/moshi • 13 items • Updated Sep 18, 2024 • 227
view article Article Train Custom Models on Hugging Face Spaces with AutoTrain SpaceRunner By abhishek • May 9, 2024 • 16
UniBench: Visual Reasoning Requires Rethinking Vision-Language Beyond Scaling Paper • 2408.04810 • Published Aug 9, 2024 • 24
VITA: Towards Open-Source Interactive Omni Multimodal LLM Paper • 2408.05211 • Published Aug 9, 2024 • 48