LLaVE is a series of large language and vision embedding models trained on a variety of multimodal embedding datasets
-
zhibinlan/LLaVE-0.5B
Image-Text-to-Text • Updated • 3.12k • 7 -
zhibinlan/LLaVE-2B
Image-Text-to-Text • Updated • 32.5k • 44 -
zhibinlan/LLaVE-7B
Image-Text-to-Text • Updated • 763 • 5 -
LLaVE: Large Language and Vision Embedding Models with Hardness-Weighted Contrastive Learning
Paper • 2503.04812 • Published • 14