Real-time Transformer-based Open-Vocabulary Detection with Efficient Fusion Head Paper • 2403.06892 • Published Mar 11 • 1
GroundVLP: Harnessing Zero-shot Visual Grounding from Vision-Language Pre-training and Open-Vocabulary Object Detection Paper • 2312.15043 • Published Dec 22, 2023 • 1
VL-CheckList: Evaluating Pre-trained Vision-Language Models with Objects, Attributes and Relations Paper • 2207.00221 • Published Jul 1, 2022 • 1
OmDet: Large-scale vision-language multi-dataset pre-training with multimodal detection network Paper • 2209.05946 • Published Sep 10, 2022 • 1
OmChat: A Recipe to Train Multimodal Language Models with Strong Long Context and Video Understanding Paper • 2407.04923 • Published Jul 6 • 1
Expanding Performance Boundaries of Open-Source Multimodal Models with Model, Data, and Test-Time Scaling Paper • 2412.05271 • Published 18 days ago • 121
ZoomEye: Enhancing Multimodal LLMs with Human-Like Zooming Capabilities through Tree-Based Image Exploration Paper • 2411.16044 • Published about 1 month ago • 1
Pangea: A Fully Open Multilingual Multimodal LLM for 39 Languages Paper • 2410.16153 • Published Oct 21 • 43