Benchmarks for COCO and Flickr?

by maxlun - opened 13 days ago

13 days ago

•

Been testing this model and it's great, subjectively it seems to often perform better in English T->I retrieval tasks than the other SigLIP2 variants. Just wondering if you ran this through other retrieval benchmarks except XM3600?

visheratin

Owner 13 days ago

I didn't run other benchmarks on this model as I'm primarily interested in multilingual performance. If you look at SigLIP2 performance, it is not the best for English, because they also focused on multilinguality. Even on XM3600, mexma-siglip2 performs better that siglip2 for English. I'd expect the trend to be the same for other benchmarks.

maxlun

13 days ago

From what I recall the SigLIP2 paper stated it used 90% English for the training data, and it does outperform SigLIP and mSigLIP on English as well. I just expected that your variant that explicitly focus on multilingual performance would not be better at English as well. Fantastic job anyway. Any plans on training lower resolution variants? 512 vs 256 makes quite a big difference in memory consumption and processing speed (for context, I'm using this to encode frames from videos)

visheratin

Owner 11 days ago

Right now, I don't have plans to train another version of the model. But when I have time and resources, it'll be something around NaViT with native aspect ratio or resolution.

maxlun

11 days ago

Gotcha. I'll try to run the English retrieval benchmarks just out or curiosity when I have some free time, there is the CLIP_benchmark repo that should be usable by implementing some wrapper for running it on transformers (expects open_clip function interface I think). I'll let you know what I find

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

Your need to confirm your account before you can post a new comment.

· Sign up or log in to comment