Purpose: Models that understand text + image + audio together.
-
llava-hf/llava-1.5-7b-hf
Image-Text-to-Text • 7B • Updated • 4.07M • 344 -
Salesforce/blip-image-captioning-base
Image-to-Text • Updated • 3.39M • 845 -
google/pix2struct-base
Image-to-Text • 0.3B • Updated • 3.2k • 76 -
microsoft/kosmos-2-patch14-224
Image-to-Text • Updated • 176k • 184