--- license: mit datasets: - imirandam/TROHN-Img --- # Model Card for CLIP_TROHN-Img_Detector ## Model Description - **Homepage:** https://imirandam.github.io/BiVLC_project_page/ - **Repository:** https://github.com/IMirandaM/BiVLC - **Paper:** https://arxiv.org/abs/2406.09952 - **Point of Contact:** [Imanol Miranda](mailto:imanol.miranda@ehu.eus) ### Model Summary CLIP_TROHN-Img_Detector is a model presented in the [BiVLC](https://github.com/IMirandaM/BiVLC) paper for experimentation. It has been trained with the OpenCLIP framework using [CLIP_TROHN-Img](https://huggingface.co/imirandam/CLIP_TROHN-Img) as a basis. For binary classification, the encoders are kept frozen. A sigmoid neuron is added over the CLS embedding for the image encoder and over the EOT embedding for the text encoder (more details in the paper). The objective of the model is to classify text and images as natural or synthetic. Hyperparameters: * Learning rate: 1e-6. * Optimizer: Adam optimizer with beta1 = 0.9, beta2 = 0.999, eps = 1e-08 and without weight decay. * Loss function: Binary cross-entropy loss (BCELoss). * Batch size: We define a batch size of 400. * Epochs: We trained the text detector over 10 epochs and the image detector over 1 epoch. We used validation accuracy as the model selection criterion, i.e. we selected the model with highest accuracy in the corresponding validation set. * Data: Then sigmoid neuron is trained with [TROHN-Img](https://huggingface.co/datasets/imirandam/TROHN-Img) dataset. ### Licensing Information This work is licensed under a MIT License. ## Citation Information If you find this dataset useful, please consider citing our paper: ``` @misc{miranda2024bivlc, title={BiVLC: Extending Vision-Language Compositionality Evaluation with Text-to-Image Retrieval}, author={Imanol Miranda and Ander Salaberria and Eneko Agirre and Gorka Azkune}, year={2024}, eprint={2406.09952}, archivePrefix={arXiv}, primaryClass={cs.CV} } ```