Thinking in Space: How Multimodal Large Language Models See, Remember, and Recall Spaces Paper • 2412.14171 • Published 7 days ago • 22
SAMURAI: Adapting Segment Anything Model for Zero-Shot Visual Tracking with Motion-Aware Memory Paper • 2411.11922 • Published Nov 18 • 18
AuroraCap: Efficient, Performant Video Detailed Captioning and a New Benchmark Paper • 2410.03051 • Published Oct 4 • 4
Chasing Consistency in Text-to-3D Generation from a Single Image Paper • 2309.03599 • Published Sep 7, 2023 • 1
RT-Pose: A 4D Radar Tensor-based 3D Human Pose Estimation and Localization Benchmark Paper • 2407.13930 • Published Jul 18
Cambrian-1: A Fully Open, Vision-Centric Exploration of Multimodal LLMs Paper • 2406.16860 • Published Jun 24 • 59
Commonsense-T2I Challenge: Can Text-to-Image Generation Models Understand Commonsense? Paper • 2406.07546 • Published Jun 11 • 8
MuirBench: A Comprehensive Benchmark for Robust Multi-image Understanding Paper • 2406.09411 • Published Jun 13 • 18
Visual Sketchpad: Sketching as a Visual Chain of Thought for Multimodal Language Models Paper • 2406.09403 • Published Jun 13 • 19
ImagenHub: Standardizing the evaluation of conditional image generation models Paper • 2310.01596 • Published Oct 2, 2023 • 18
BLINK: Multimodal Large Language Models Can See but Not Perceive Paper • 2404.12390 • Published Apr 18 • 24
Deconstructing Denoising Diffusion Models for Self-Supervised Learning Paper • 2401.14404 • Published Jan 25 • 17
SiT: Exploring Flow and Diffusion-based Generative Models with Scalable Interpolant Transformers Paper • 2401.08740 • Published Jan 16 • 12
StableVideo: Text-driven Consistency-aware Diffusion Video Editing Paper • 2308.09592 • Published Aug 18, 2023 • 2
MovieChat: From Dense Token to Sparse Memory for Long Video Understanding Paper • 2307.16449 • Published Jul 31, 2023 • 15