Benchmarks for COCO and Flickr?

#4
by maxlun - opened

Been testing this model and it's great, subjectively it seems to often perform better in English T->I retrieval tasks than the other SigLIP2 variants. Just wondering if you ran this through other retrieval benchmarks except XM3600?

I didn't run other benchmarks on this model as I'm primarily interested in multilingual performance. If you look at SigLIP2 performance, it is not the best for English, because they also focused on multilinguality. Even on XM3600, mexma-siglip2 performs better that siglip2 for English. I'd expect the trend to be the same for other benchmarks.
image.png

From what I recall the SigLIP2 paper stated it used 90% English for the training data, and it does outperform SigLIP and mSigLIP on English as well. I just expected that your variant that explicitly focus on multilingual performance would not be better at English as well. Fantastic job anyway. Any plans on training lower resolution variants? 512 vs 256 makes quite a big difference in memory consumption and processing speed (for context, I'm using this to encode frames from videos)

Right now, I don't have plans to train another version of the model. But when I have time and resources, it'll be something around NaViT with native aspect ratio or resolution.

Gotcha. I'll try to run the English retrieval benchmarks just out or curiosity when I have some free time, there is the CLIP_benchmark repo that should be usable by implementing some wrapper for running it on transformers (expects open_clip function interface I think). I'll let you know what I find

Your need to confirm your account before you can post a new comment.

Sign up or log in to comment