jinaai/jina-clip-v2 · do you have 224x224 or 386x386 pretrained clip model?

13 days ago

Thanks for your great work. It is useful.But I found it is a little slow when inference image encoder, so do you have 224x224 or 386x386 pretrained clip model?

igvasilev

12 days ago

+++, I can't infer images in a reasonable time because of their size. It can't be fitted into RAM, so I can't use all the available GPUs.

gmastrapas

Jina AI org 4 days ago

Thanks for your feedback! We do not offer a lower resolution version for this model, however if you dont need multilinguality, you can check https://huggingface.co/jinaai/jina-clip-v1, a smaller model with smaller input resolution with similar performance on english text and cross-modal tasks.

That said, you can improve the inference speed by using bf16, xformers and flash-attention. You can also try a higher patch dropout to drop more image patches before processing. If model is still slow, I suggest you try out the ONNX model and the quantized versions