SoTA with Less: MCTS-Guided Sample Selection for Data-Efficient Visual Reasoning Self-Improvement Paper • 2504.07934 • Published 11 days ago • 16
DisCo: Disentangled Control for Referring Human Dance Generation in Real World Paper • 2307.00040 • Published Jun 30, 2023 • 25
Equivariant Similarity for Vision-Language Foundation Models Paper • 2303.14465 • Published Mar 25, 2023
LAVENDER: Unifying Video-Language Understanding as Masked Language Modeling Paper • 2206.07160 • Published Jun 14, 2022
SwinBERT: End-to-End Transformers with Sparse Attention for Video Captioning Paper • 2111.13196 • Published Nov 25, 2021
The Dawn of LMMs: Preliminary Explorations with GPT-4V(ision) Paper • 2309.17421 • Published Sep 29, 2023 • 3
Idea2Img: Iterative Self-Refinement with GPT-4V(ision) for Automatic Image Design and Generation Paper • 2310.08541 • Published Oct 12, 2023 • 18
MM-VID: Advancing Video Understanding with GPT-4V(ision) Paper • 2310.19773 • Published Oct 30, 2023 • 20