Model Card for CAPIVARA

CAPIVARA is a cost-efficient framework designed to enhance the performance of multilingual CLIP models in low-resource languages. This model holds the state of the art in many zero-shot tasks involving images and Portuguese texts.

How to use

import open_clip

model, preprocess_train, preprocess_val = open_clip.create_model_and_transforms('hf-hub:hiaac-nlp/CAPIVARA')
tokenizer = open_clip.get_tokenizer('hf-hub:hiaac-nlp/CAPIVARA')

For more details refer to Github repo.

Model Details

Model Description

CAPIVARA is built upon pre-trained OpenCLIP ViT-B/32 XLM-Roberta Base and fine-tuned with Conceptual Captions and synthetic captions generated by BLIP2. All the captions are translated with Google Translator.

Uses

Direct Use

Zero-shot image classification, zero-shot image and text retrieval, etc.

Downstream Use

Image classification and other image task fine-tuning, linear probe image classification, image captioning, image generation guiding and conditioning, etc.

Ethical considerations

For ethical considerations, please, refer to the Model Cards section in the paper.

Training Details

Training Data

The model was fine-tuned with Conceptual Captions and synthetic captions generated by BLIP2. All the captions are translated with Google Translator.

Training Hyperparameters

Optimizer: "Adam"
eps: 1e-8
weight_decay: 0.2
betas: [ 0.9, 0.98 ]

LR_scheduler: "CosineWarmupLR"
min_learning_rate: 1e-7
max_learning_rate: 5e-7
warmup_lr: 500

batch_size: 2816
max_steps: 5863 # 10 epochs

Evaluation

Testing Data, Factors & Metrics

Testing Data

For cross-modal retrieval, we used PraCegoVer, which is composed of images annotated originally with Portuguese texts, and our Portuguese-translated versions of MS COCO and Flickr30k. We also translate the labels from ImageNet and the
ELEVATER benchmark datasets for image classification.

Results

Zero-shot Cross-Modal Retrieval

We conducted zero-shot cross-modal retrieval experiments on Flickr30k and MS COCO with captions translated into Portuguese, and PraCegoVer. We report the average and standard deviation for 3 runs.

Models	Flickr30k		MS COCO		PraCegoVer
	text-to-image	image-to-text	text-to-image	image-to-text	text-to-image	image-to-text
OpenCLIP ViT-B/32 XLM-Roberta Base (Baseline)	76.23	87.93	52.62	66.55	65.36	69.43
CAPIVARA	79.56 ± 0.01	89.95 ± 0.04	56.27 ± 0.01	71.24 ± 0.01	66.40 ± 0.01	64.75 ± 0.01

Zero-shot image classification

Models	Caltech-101	CIFAR-10	CIFAR-100	Country-211	DTD	EuroSAT	FER-2013	FGVC-Aircraft	Food-101	GTSRB	Hateful-Memes	KITTI-Distance	MNIST	Oxford Flowers-102	Oxford-IIIT Pets	PatchCamelyon	Rendered-SST2	RESISC-45	Stanford-Cars	PASCAL VOC-2007	Average	ImageNet-1k
OpenCLIP ViT-B/32 XLM-Roberta Base (Baseline)	84.53 ± 0.00	93.99 ± 0.00	68.44 ± 0.00	17.82 ± 0.00	41.17 ± 0.00	47.16 ± 0.00	48.65 ± 0.00	26.30 ± 0.00	65.06 ± 0.00	43.27 ± 0.00	56.50 ± 0.00	28.41 ± 0.00	54.99 ± 0.00	50.88 ± 0.00	81.56 ± 0.00	50.96 ± 0.00	54.20 ± 0.00	58.51 ± 0.00	84.93 ± 0.00	82.09 ± 0.00	56.97 ± 0.00	45.84 ± 0.00
CAPIVARA	82.97 ± 0.03	93.85 ± 0.00	69.37 ± 0.01	17.61 ± 0.00	42.34 ± 0.04	47.77 ± 0.02	46.68 ± 0.05	25.49 ± 0.01	64.58 ± 0.01	46.34 ± 0.01	56.17 ± 0.00	33.94 ± 0.13	60.14 ± 0.04	49.93 ± 0.02	79.37 ± 0.00	51.71 ± 0.01	54.82 ± 0.03	59.71 ± 0.01	85.10 ± 0.02	82.29 ± 0.00	57.51 ± 0.02	46.06 ± 0.01

Environmental Impact

GPU: 1 x Quadro RTX 8000 (48 GB)
Hours used: 31 hours
Compute Region: Brazil
Carbon footprint: 0.5 Kg
Energy: 6.49 kW

Citation

@inproceedings{santos2023capivara,
  title={CAPIVARA: Cost-Efficient Approach for Improving Multilingual CLIP Performance on Low-Resource Languages},
  author={Santos, Gabriel O. dos and Moreira, Diego A. B. and Ferreira, Alef I. and Silva, Jhessica and Pereira, Luiz and Bueno, Pedro and Sousa, Thiago and Maia, Helena and da Silva, N{\'a}dia and Colombini, Esther and Pedrini, Helio and Avila, Sandra},
  booktitle = "Workshop on Multi-lingual Representation Learning (MRL), Conference on Empirical Methods in Natural Language Processing (EMNLP)",
  year = "2023"
}

hiaac-nlp
/

CAPIVARA