Model Card for CLIP_Detector

Model Description

Model Summary

CLIP_Detector is a model presented in the BiVLC paper for experimentation. It has been trained with the OpenCLIP framework using the CLIP ViT-B-32 model pre-trained by 'openai' as a basis. For binary classification, the encoders are kept frozen. A sigmoid neuron is added over the CLS embedding for the image encoder and over the EOT embedding for the text encoder (more details in the paper). The objective of the model is to classify text and images as natural or synthetic. Hyperparameters:

  • Learning rate: 1e-6.
  • Optimizer: Adam optimizer with beta1 = 0.9, beta2 = 0.999, eps = 1e-08 and without weight decay.
  • Loss function: Binary cross-entropy loss (BCELoss).
  • Batch size: We define a batch size of 400.
  • Epochs: We trained the text detector over 10 epochs and the image detector over 1 epoch. We used validation accuracy as the model selection criterion, i.e. we selected the model with highest accuracy in the corresponding validation set.
  • Data: Then sigmoid neuron is trained with TROHN-Img dataset.

Licensing Information

This work is licensed under a MIT License.

Citation Information

If you find this dataset useful, please consider citing our paper:

@misc{miranda2024bivlc,
      title={BiVLC: Extending Vision-Language Compositionality Evaluation with Text-to-Image Retrieval}, 
      author={Imanol Miranda and Ander Salaberria and Eneko Agirre and Gorka Azkune},
      year={2024},
      eprint={2406.09952},
      archivePrefix={arXiv},
      primaryClass={cs.CV}
}
Downloads last month

-

Downloads are not tracked for this model. How to track
Inference API
Unable to determine this model's library. Check the docs .

Dataset used to train imirandam/CLIP_Detector

Collection including imirandam/CLIP_Detector