YuE: Scaling Open Foundation Models for Long-Form Music Generation Paper • 2503.08638 • Published 5 days ago • 56
ABC: Achieving Better Control of Multimodal Embeddings using VLMs Paper • 2503.00329 • Published 16 days ago • 18
TheoremExplainAgent: Towards Multimodal Explanations for LLM Theorem Understanding Paper • 2502.19400 • Published 18 days ago • 43
MAmmoTH-VL: Eliciting Multimodal Reasoning with Instruction Tuning at Scale Paper • 2412.05237 • Published Dec 6, 2024 • 47
VISTA: Enhancing Long-Duration and High-Resolution Video Understanding by Video Spatiotemporal Augmentation Paper • 2412.00927 • Published Dec 1, 2024 • 26
VISTA: Enhancing Long-Duration and High-Resolution Video Understanding by Video Spatiotemporal Augmentation Paper • 2412.00927 • Published Dec 1, 2024 • 26
VISTA: Enhancing Long-Duration and High-Resolution Video Understanding by Video Spatiotemporal Augmentation Paper • 2412.00927 • Published Dec 1, 2024 • 26
OmniEdit: Building Image Editing Generalist Models Through Specialist Supervision Paper • 2411.07199 • Published Nov 11, 2024 • 47
VIEScore: Towards Explainable Metrics for Conditional Image Synthesis Evaluation Paper • 2312.14867 • Published Dec 22, 2023 • 1