Model Card for CLIP_COCO

Model Description

Model Summary

CLIP_COCO is a model presented in the BiVLC paper for experimentation. It has been fine-tuned with OpenCLIP framework using as basis the CLIP ViT-B-32 model pre-trained by 'openai'. The idea behind this fine-tuning is to have a baseline to compare the CLIP_TROHN-Text and CLIP_TROHN-Img models. Hyperparameters:

  • Learning rate: 1e-6.
  • Scheduler: Cosine scheduler with 50 warmup steps.
  • Optimizer: AdamW optimizer with beta1 = 0.9, beta2 = 0.98, eps = 1e-6 and weight decay = 0.1.
  • Loss function: InfoNCE Loss.
  • Batch size: We define a batch size of 400, resulting in 400 images x 400 captions.
  • Epochs: We fine-tune all models over 10 epochs and we used validation accuracy as the model selection criterion, i.e. we selected the model with the highest accuracy on the corresponding validation set.
  • Data: It is fine-tuned with COCO 2017 train split.

Evaluation Data

The model is evaluated in BiVLC.

Licensing Information

This work is licensed under a MIT License.

Citation Information

If you find this dataset useful, please consider citing our paper:

@misc{miranda2024bivlc,
      title={BiVLC: Extending Vision-Language Compositionality Evaluation with Text-to-Image Retrieval}, 
      author={Imanol Miranda and Ander Salaberria and Eneko Agirre and Gorka Azkune},
      year={2024},
      eprint={2406.09952},
      archivePrefix={arXiv},
      primaryClass={cs.CV}
}
Downloads last month

-

Downloads are not tracked for this model. How to track
Inference API
Unable to determine this model's library. Check the docs .

Collection including imirandam/CLIP_COCO