Regarding model fine-tuning

#23
by mylsz - opened

This is amazing work—thank you for your contribution and for making it open source!
Can you provide some suggestions on how to continue fine-tuning the jinaclip-v2 model with local data? After I comment out @torch .inference_mode(), can I directly load and fine-tune this model?

Probably the forward method might be more convenient https://huggingface.co/jinaai/jina-clip-implementation/blob/main/modeling_clip.py#L637. You can of course fine-tune the model, I suggest you take a look at our technical report to understand our training recipe https://arxiv.org/abs/2412.08802

Basically if you care about text retrieval performance, you need to maintain the performance by adding text pairs (or even text triplets) alongside image-caption pairs. Otherwise simple CLIP like fine-tuning would be enough

@gmastrapas thanks for your advise. Do you have the training code public?

Jina AI org

No I am afraid the training code is not public

@gmastrapas Thanks!

mylsz changed discussion status to closed

@gmastrapas @mylsz @AaronJim Hi, every one, I have written a project to finetune jina-clip-v2. It is training now. But I have only one 4090 gpu card. So the max batchsize I can set is 5. I am unsure whether using a small batch size, such as 5, for the contrastive loss in CLIP will negatively affect the training results. If there is an impact, are there any good methods to achieve a larger batch size on a single GPU?
This is my training code:https://github.com/tengshaofeng/finetune-jina-clip-v2/blob/main/train_clip.py

Sign up or log in to comment