I love vision language models 💗 My favorite is KOSMOS-2, because it's a grounded model (it doesn't hallucinate). In this demo you can, - ask a question about the image, - do detailed/brief captioning, - localize the objects! 🤯 It's just amazing for VLM to return bounding boxes 🤩 Try it here merve/kosmos2