Regarding model fine-tuning
Probably the forward method might be more convenient https://huggingface.co/jinaai/jina-clip-implementation/blob/main/modeling_clip.py#L637. You can of course fine-tune the model, I suggest you take a look at our technical report to understand our training recipe https://arxiv.org/abs/2412.08802
Basically if you care about text retrieval performance, you need to maintain the performance by adding text pairs (or even text triplets) alongside image-caption pairs. Otherwise simple CLIP like fine-tuning would be enough
@gmastrapas thanks for your advise. Do you have the training code public?
No I am afraid the training code is not public
@gmastrapas Thanks!
@gmastrapas
@mylsz
@AaronJim
Hi, every one, I have written a project to finetune jina-clip-v2. It is training now. But I have only one 4090 gpu card. So the max batchsize I can set is 5. I am unsure whether using a small batch size, such as 5, for the contrastive loss in CLIP will negatively affect the training results. If there is an impact, are there any good methods to achieve a larger batch size on a single GPU?
This is my training code:https://github.com/tengshaofeng/finetune-jina-clip-v2/blob/main/train_clip.py