TraceVLA: Visual Trace Prompting Enhances Spatial-Temporal Awareness for Generalist Robotic Policies Paper • 2412.10345 • Published 30 days ago • 2
OLA-VLM: Elevating Visual Perception in Multimodal LLMs with Auxiliary Embedding Distillation Paper • 2412.09585 • Published about 1 month ago • 10
Florence-VL: Enhancing Vision-Language Models with Generative Vision Encoder and Depth-Breadth Fusion Paper • 2412.04424 • Published Dec 5, 2024 • 59
TemporalBench: Benchmarking Fine-grained Temporal Understanding for Multimodal Video Models Paper • 2410.10818 • Published Oct 14, 2024 • 15
List Items One by One: A New Data Source and Learning Paradigm for Multimodal LLMs Paper • 2404.16375 • Published Apr 25, 2024 • 16
LLaVA-Grounding: Grounded Visual Chat with Large Multimodal Models Paper • 2312.02949 • Published Dec 5, 2023 • 11
GPT-4V in Wonderland: Large Multimodal Models for Zero-Shot Smartphone GUI Navigation Paper • 2311.07562 • Published Nov 13, 2023 • 13
LLaVA-Plus: Learning to Use Tools for Creating Multimodal Agents Paper • 2311.05437 • Published Nov 9, 2023 • 48
LLaVA-Interactive: An All-in-One Demo for Image Chat, Segmentation, Generation and Editing Paper • 2311.00571 • Published Nov 1, 2023 • 41
Set-of-Mark Prompting Unleashes Extraordinary Visual Grounding in GPT-4V Paper • 2310.11441 • Published Oct 17, 2023 • 26
An Empirical Study of Scaling Instruct-Tuned Large Multimodal Models Paper • 2309.09958 • Published Sep 18, 2023 • 18
Multimodal Foundation Models: From Specialists to General-Purpose Assistants Paper • 2309.10020 • Published Sep 18, 2023 • 40
Semantic-SAM: Segment and Recognize Anything at Any Granularity Paper • 2307.04767 • Published Jul 10, 2023 • 21