-
EVA-CLIP-18B: Scaling CLIP to 18 Billion Parameters
Paper • 2402.04252 • Published • 28 -
Vision Superalignment: Weak-to-Strong Generalization for Vision Foundation Models
Paper • 2402.03749 • Published • 13 -
ScreenAI: A Vision-Language Model for UI and Infographics Understanding
Paper • 2402.04615 • Published • 44 -
EfficientViT-SAM: Accelerated Segment Anything Model Without Performance Loss
Paper • 2402.05008 • Published • 23
Collections
Discover the best community collections!
Collections including paper arxiv:2504.00557
-
QLIP: Text-Aligned Visual Tokenization Unifies Auto-Regressive Multimodal Understanding and Generation
Paper • 2502.05178 • Published • 10 -
Scaling Text-Rich Image Understanding via Code-Guided Synthetic Multimodal Data Generation
Paper • 2502.14846 • Published • 13 -
SigLIP 2: Multilingual Vision-Language Encoders with Improved Semantic Understanding, Localization, and Dense Features
Paper • 2502.14786 • Published • 141 -
Efficient LLaMA-3.2-Vision by Trimming Cross-attended Visual Features
Paper • 2504.00557 • Published • 15
-
Unveiling the Backbone-Optimizer Coupling Bias in Visual Representation Learning
Paper • 2410.06373 • Published • 34 -
MergeVQ: A Unified Framework for Visual Generation and Representation with Disentangled Token Merging and Quantization
Paper • 2504.00999 • Published • 78 -
What, How, Where, and How Well? A Survey on Test-Time Scaling in Large Language Models
Paper • 2503.24235 • Published • 51 -
MoCha: Towards Movie-Grade Talking Character Synthesis
Paper • 2503.23307 • Published • 120
-
DocLLM: A layout-aware generative language model for multimodal document understanding
Paper • 2401.00908 • Published • 184 -
COSMO: COntrastive Streamlined MultimOdal Model with Interleaved Pre-Training
Paper • 2401.00849 • Published • 17 -
LLaVA-Plus: Learning to Use Tools for Creating Multimodal Agents
Paper • 2311.05437 • Published • 50 -
LLaVA-Interactive: An All-in-One Demo for Image Chat, Segmentation, Generation and Editing
Paper • 2311.00571 • Published • 41