CAPIVARA / README.md
gabrielsantosrv's picture
Update README.md
726ff2b
metadata
license: mit
language:
  - pt
tags:
  - CAPIVARA
  - Portuguese CLIP
  - OpenCLIP
datasets:
  - conceptual_captions
  - PraCegoVer
  - MS_COCO
  - Flickr30K
  - ImageNet
  - ELEVATER

Model Card for CAPIVARA

CAPIVARA is a cost-efficient framework designed to enhance the performance of multilingual CLIP models in low-resource languages. This model holds the state of the art in many zero-shot tasks involving images and Portuguese texts.

How to use

import open_clip

model, preprocess_train, preprocess_val = open_clip.create_model_and_transforms('hf-hub:hiaac-nlp/CAPIVARA')
tokenizer = open_clip.get_tokenizer('hf-hub:hiaac-nlp/CAPIVARA')

For more details refer to Github repo.

Model Details

Model Description

CAPIVARA is built upon pre-trained OpenCLIP ViT-B/32 XLM-Roberta Base and fine-tuned with Conceptual Captions and synthetic captions generated by BLIP2. All the captions are translated with Google Translator.

Uses

Direct Use

Zero-shot image classification, zero-shot image and text retrieval, etc.

Downstream Use

Image classification and other image task fine-tuning, linear probe image classification, image captioning, image generation guiding and conditioning, etc.

Ethical considerations

For ethical considerations, please, refer to the Model Cards section in the paper.

Training Details

Training Data

The model was fine-tuned with Conceptual Captions and synthetic captions generated by BLIP2. All the captions are translated with Google Translator.

Training Hyperparameters

Optimizer: "Adam"
eps: 1e-8
weight_decay: 0.2
betas: [ 0.9, 0.98 ]

LR_scheduler: "CosineWarmupLR"
min_learning_rate: 1e-7
max_learning_rate: 5e-7
warmup_lr: 500

batch_size: 2816
max_steps: 5863 # 10 epochs

Evaluation

Testing Data, Factors & Metrics

Testing Data

For cross-modal retrieval, we used PraCegoVer, which is composed of images annotated originally with Portuguese texts, and our Portuguese-translated versions of MS COCO and Flickr30k. We also translate the labels from ImageNet and the
ELEVATER benchmark datasets for image classification.

Results

Zero-shot Cross-Modal Retrieval

We conducted zero-shot cross-modal retrieval experiments on Flickr30k and MS COCO with captions translated into Portuguese, and PraCegoVer. We report the average and standard deviation for 3 runs.

Models Flickr30k MS COCO PraCegoVer
text-to-image image-to-text text-to-image image-to-text text-to-image image-to-text
OpenCLIP ViT-B/32 XLM-Roberta Base (Baseline) 76.23 87.93 52.62 66.55 65.36 69.43
CAPIVARA 79.56 ± 0.01 89.95 ± 0.04 56.27 ± 0.01 71.24 ± 0.01 66.40 ± 0.01 64.75 ± 0.01

Zero-shot image classification

Models Caltech-101 CIFAR-10 CIFAR-100 Country-211 DTD EuroSAT FER-2013 FGVC-Aircraft Food-101 GTSRB Hateful-Memes KITTI-Distance MNIST Oxford Flowers-102 Oxford-IIIT Pets PatchCamelyon Rendered-SST2 RESISC-45 Stanford-Cars PASCAL VOC-2007 Average ImageNet-1k
OpenCLIP ViT-B/32 XLM-Roberta Base (Baseline) 84.53 ± 0.00 93.99 ± 0.00 68.44 ± 0.00 17.82 ± 0.00 41.17 ± 0.00 47.16 ± 0.00 48.65 ± 0.00 26.30 ± 0.00 65.06 ± 0.00 43.27 ± 0.00 56.50 ± 0.00 28.41 ± 0.00 54.99 ± 0.00 50.88 ± 0.00 81.56 ± 0.00 50.96 ± 0.00 54.20 ± 0.00 58.51 ± 0.00 84.93 ± 0.00 82.09 ± 0.00 56.97 ± 0.00 45.84 ± 0.00
CAPIVARA 82.97 ± 0.03 93.85 ± 0.00 69.37 ± 0.01 17.61 ± 0.00 42.34 ± 0.04 47.77 ± 0.02 46.68 ± 0.05 25.49 ± 0.01 64.58 ± 0.01 46.34 ± 0.01 56.17 ± 0.00 33.94 ± 0.13 60.14 ± 0.04 49.93 ± 0.02 79.37 ± 0.00 51.71 ± 0.01 54.82 ± 0.03 59.71 ± 0.01 85.10 ± 0.02 82.29 ± 0.00 57.51 ± 0.02 46.06 ± 0.01

Environmental Impact

  • GPU: 1 x Quadro RTX 8000 (48 GB)
  • Hours used: 31 hours
  • Compute Region: Brazil
  • Carbon footprint: 0.5 Kg
  • Energy: 6.49 kW

Citation

@inproceedings{santos2023capivara,
  title={CAPIVARA: Cost-Efficient Approach for Improving Multilingual CLIP Performance on Low-Resource Languages},
  author={Santos, Gabriel O. dos and Moreira, Diego A. B. and Ferreira, Alef I. and Silva, Jhessica and Pereira, Luiz and Bueno, Pedro and Sousa, Thiago and Maia, Helena and da Silva, N{\'a}dia and Colombini, Esther and Pedrini, Helio and Avila, Sandra},
  booktitle = "Workshop on Multi-lingual Representation Learning (MRL), Conference on Empirical Methods in Natural Language Processing (EMNLP)",
  year = "2023"
}