II-Bench: An Image Implication Understanding Benchmark for Multimodal Large Language Models Paper • 2406.05862 • Published Jun 9 • 4
Muskits-ESPnet: A Comprehensive Toolkit for Singing Voice Synthesis in New Paradigm Paper • 2409.07226 • Published Sep 11
MMMT-IF: A Challenging Multimodal Multi-Turn Instruction Following Benchmark Paper • 2409.18216 • Published Sep 26
OmniHD-Scenes: A Next-Generation Multimodal Dataset for Autonomous Driving Paper • 2412.10734 • Published 11 days ago
OmniDocBench: Benchmarking Diverse PDF Document Parsing with Comprehensive Annotations Paper • 2412.07626 • Published 15 days ago • 20
MAmmoTH-VL: Eliciting Multimodal Reasoning with Instruction Tuning at Scale Paper • 2412.05237 • Published 19 days ago • 45
JMMMU: A Japanese Massive Multi-discipline Multimodal Understanding Benchmark for Culture-aware Evaluation Paper • 2410.17250 • Published Oct 22 • 14
Long Context Alignment with Short Instructions and Synthesized Positions Paper • 2405.03939 • Published May 7
MAmmoTH-VL: Eliciting Multimodal Reasoning with Instruction Tuning at Scale Paper • 2412.05237 • Published 19 days ago • 45
MAmmoTH-VL: Eliciting Multimodal Reasoning with Instruction Tuning at Scale Paper • 2412.05237 • Published 19 days ago • 45
Evaluating Language Models as Synthetic Data Generators Paper • 2412.03679 • Published 20 days ago • 43
PhysGame: Uncovering Physical Commonsense Violations in Gameplay Videos Paper • 2412.01800 • Published 22 days ago • 6
VISTA: Enhancing Long-Duration and High-Resolution Video Understanding by Video Spatiotemporal Augmentation Paper • 2412.00927 • Published 24 days ago • 26
PhysGame: Uncovering Physical Commonsense Violations in Gameplay Videos Paper • 2412.01800 • Published 22 days ago • 6
OmniEdit: Building Image Editing Generalist Models Through Specialist Supervision Paper • 2411.07199 • Published Nov 11 • 45
When Precision Meets Position: BFloat16 Breaks Down RoPE in Long-Context Training Paper • 2411.13476 • Published Nov 20 • 15