Crowdsource, Crawl, or Generate? Creating SEA-VL, a Multicultural Vision-Language Dataset for Southeast Asia Paper • 2503.07920 • Published Mar 10 • 97
SkillWeaver: Web Agents can Self-Improve by Discovering and Honing Skills Paper • 2504.07079 • Published 9 days ago • 11
VisualPuzzles: Decoupling Multimodal Reasoning Evaluation from Domain Knowledge Paper • 2504.10342 • Published 4 days ago • 9
Scaling Evaluation-time Compute with Reasoning Models as Process Evaluators Paper • 2503.19877 • Published 24 days ago
ScholarCopilot: Training Large Language Models for Academic Writing with Accurate Citations Paper • 2504.00824 • Published 17 days ago • 38
Mind the Gap! Static and Interactive Evaluations of Large Audio Models Paper • 2502.15919 • Published Feb 21 • 4
EgoNormia: Benchmarking Physical Social Norm Understanding Paper • 2502.20490 • Published Feb 27 • 5
Grounded Persuasive Language Generation for Automated Marketing Paper • 2502.16810 • Published Feb 24 • 12
HAICOSYSTEM: An Ecosystem for Sandboxing Safety Risks in Human-AI Interactions Paper • 2409.16427 • Published Sep 24, 2024 • 1
What Are Tools Anyway? A Survey from the Language Model Perspective Paper • 2403.15452 • Published Mar 18, 2024
Critique Fine-Tuning: Learning to Critique is More Effective than Learning to Imitate Paper • 2501.17703 • Published Jan 29 • 58
Video-MMMU: Evaluating Knowledge Acquisition from Multi-Discipline Professional Videos Paper • 2501.13826 • Published Jan 23 • 26
CodeRAG-Bench: Can Retrieval Augment Code Generation? Paper • 2406.14497 • Published Jun 20, 2024 • 2