--- license: mit language: - pt tags: - CAPIVARA - Portuguese CLIP - OpenCLIP datasets: - conceptual_captions - PraCegoVer - MS_COCO - Flickr30K - ImageNet - ELEVATER --- # Model Card for CAPIVARA CAPIVARA is a cost-efficient framework designed to enhance the performance of multilingual CLIP models in low-resource languages. This model holds the state of the art in many zero-shot tasks involving images and Portuguese texts. ## How to use ```python import open_clip model, preprocess_train, preprocess_val = open_clip.create_model_and_transforms('hf-hub:hiaac-nlp/CAPIVARA') tokenizer = open_clip.get_tokenizer('hf-hub:hiaac-nlp/CAPIVARA') ``` For more details refer to [Github repo](https://github.com/hiaac-nlp/CAPIVARA/). ## Model Details ### Model Description CAPIVARA is built upon pre-trained [OpenCLIP ViT-B/32 XLM-Roberta Base](https://huggingface.co/laion/CLIP-ViT-B-32-xlm-roberta-base-laion5B-s13B-b90k) and fine-tuned with [Conceptual Captions](https://aclanthology.org/P18-1238.pdf) and synthetic captions generated by [BLIP2](https://huggingface.co/Salesforce/blip2-opt-2.7b-coco). All the captions are translated with Google Translator. ## Uses ### Direct Use Zero-shot image classification, zero-shot image and text retrieval, etc. ### Downstream Use Image classification and other image task fine-tuning, linear probe image classification, image captioning, image generation guiding and conditioning, etc. ## Ethical considerations For ethical considerations, please, refer to the Model Cards section in the [paper](https://arxiv.org/abs/2310.13683). ## Training Details ### Training Data The model was fine-tuned with [Conceptual Captions](https://aclanthology.org/P18-1238.pdf) and synthetic captions generated by [BLIP2](https://huggingface.co/Salesforce/blip2-opt-2.7b-coco). All the captions are translated with Google Translator. #### Training Hyperparameters ``` Optimizer: "Adam" eps: 1e-8 weight_decay: 0.2 betas: [ 0.9, 0.98 ] LR_scheduler: "CosineWarmupLR" min_learning_rate: 1e-7 max_learning_rate: 5e-7 warmup_lr: 500 batch_size: 2816 max_steps: 5863 # 10 epochs ``` ## Evaluation + [Zero-shot image classification](https://github.com/hiaac-nlp/CAPIVARA/blob/main/clip_pt/src/evaluate/capivara_classification.ipynb) + [Zero-shot cross-modal retrieval](https://github.com/hiaac-nlp/CAPIVARA/blob/main/clip_pt/src/evaluate/capivara_retrieval.ipynb) ### Testing Data, Factors & Metrics #### Testing Data For cross-modal retrieval, we used [PraCegoVer](https://www.mdpi.com/2306-5729/7/2/13), which is composed of images annotated originally with Portuguese texts, and our Portuguese-translated versions of [MS COCO](https://link.springer.com/chapter/10.1007/978-3-319-10602-1_48) and [Flickr30k](https://openaccess.thecvf.com/content_iccv_2015/papers/Plummer_Flickr30k_Entities_Collecting_ICCV_2015_paper.pdf). We also translate the labels from [ImageNet](https://ieeexplore.ieee.org/document/5206848) and the [ELEVATER](https://proceedings.neurips.cc/paper_files/paper/2022/hash/3c4688b6a76f25f2311daa0d75a58f1a-Abstract-Datasets_and_Benchmarks.html) benchmark datasets for image classification. ### Results #### Zero-shot Cross-Modal Retrieval We conducted zero-shot cross-modal retrieval experiments on Flickr30k and MS COCO with captions translated into Portuguese, and PraCegoVer. We report the average and standard deviation for 3 runs.
Models | Flickr30k | MS COCO | PraCegoVer | |||
---|---|---|---|---|---|---|
text-to-image | image-to-text | text-to-image | image-to-text | text-to-image | image-to-text | |
OpenCLIP ViT-B/32 XLM-Roberta Base (Baseline) | 76.23 | 87.93 | 52.62 | 66.55 | 65.36 | 69.43 |
CAPIVARA | 79.56 ± 0.01 | 89.95 ± 0.04 | 56.27 ± 0.01 | 71.24 ± 0.01 | 66.40 ± 0.01 | 64.75 ± 0.01 |