|
--- |
|
license: mit |
|
language: |
|
- pt |
|
tags: |
|
- CAPIVARA |
|
- Portuguese CLIP |
|
- OpenCLIP |
|
datasets: |
|
- conceptual_captions |
|
- PraCegoVer |
|
- MS_COCO |
|
- Flickr30K |
|
- ImageNet |
|
- ELEVATER |
|
--- |
|
# Model Card for CAPIVARA |
|
|
|
CAPIVARA is a cost-efficient framework designed to enhance the performance of multilingual CLIP models in low-resource languages. |
|
This model holds the state of the art in many zero-shot tasks involving images and Portuguese texts. |
|
|
|
## How to use |
|
```python |
|
import open_clip |
|
|
|
model, preprocess_train, preprocess_val = open_clip.create_model_and_transforms('hf-hub:hiaac-nlp/CAPIVARA') |
|
tokenizer = open_clip.get_tokenizer('hf-hub:hiaac-nlp/CAPIVARA') |
|
``` |
|
|
|
For more details refer to [Github repo](https://github.com/hiaac-nlp/CAPIVARA/). |
|
|
|
|
|
## Model Details |
|
|
|
### Model Description |
|
|
|
CAPIVARA is built upon pre-trained [OpenCLIP ViT-B/32 XLM-Roberta Base](https://huggingface.co/laion/CLIP-ViT-B-32-xlm-roberta-base-laion5B-s13B-b90k) and |
|
fine-tuned with [Conceptual Captions](https://aclanthology.org/P18-1238.pdf) and synthetic captions generated by [BLIP2](https://huggingface.co/Salesforce/blip2-opt-2.7b-coco). |
|
All the captions are translated with Google Translator. |
|
|
|
|
|
## Uses |
|
|
|
### Direct Use |
|
|
|
Zero-shot image classification, zero-shot image and text retrieval, etc. |
|
|
|
### Downstream Use |
|
|
|
Image classification and other image task fine-tuning, linear probe image classification, |
|
image captioning, image generation guiding and conditioning, etc. |
|
|
|
|
|
## Ethical considerations |
|
|
|
For ethical considerations, please, refer to the Model Cards section in the [paper](https://arxiv.org/abs/2310.13683). |
|
|
|
## Training Details |
|
|
|
### Training Data |
|
The model was fine-tuned with [Conceptual Captions](https://aclanthology.org/P18-1238.pdf) and synthetic captions generated by [BLIP2](https://huggingface.co/Salesforce/blip2-opt-2.7b-coco). |
|
All the captions are translated with Google Translator. |
|
|
|
#### Training Hyperparameters |
|
``` |
|
Optimizer: "Adam" |
|
eps: 1e-8 |
|
weight_decay: 0.2 |
|
betas: [ 0.9, 0.98 ] |
|
|
|
LR_scheduler: "CosineWarmupLR" |
|
min_learning_rate: 1e-7 |
|
max_learning_rate: 5e-7 |
|
warmup_lr: 500 |
|
|
|
batch_size: 2816 |
|
max_steps: 5863 # 10 epochs |
|
``` |
|
|
|
## Evaluation |
|
|
|
+ [Zero-shot image classification](https://github.com/hiaac-nlp/CAPIVARA/blob/main/clip_pt/src/evaluate/capivara_classification.ipynb) |
|
+ [Zero-shot cross-modal retrieval](https://github.com/hiaac-nlp/CAPIVARA/blob/main/clip_pt/src/evaluate/capivara_retrieval.ipynb) |
|
|
|
|
|
### Testing Data, Factors & Metrics |
|
|
|
#### Testing Data |
|
|
|
For cross-modal retrieval, we used [PraCegoVer](https://www.mdpi.com/2306-5729/7/2/13), which is composed of images annotated originally |
|
with Portuguese texts, and our Portuguese-translated versions of [MS COCO](https://link.springer.com/chapter/10.1007/978-3-319-10602-1_48) |
|
and [Flickr30k](https://openaccess.thecvf.com/content_iccv_2015/papers/Plummer_Flickr30k_Entities_Collecting_ICCV_2015_paper.pdf). |
|
We also translate the labels from [ImageNet](https://ieeexplore.ieee.org/document/5206848) and the |
|
[ELEVATER](https://proceedings.neurips.cc/paper_files/paper/2022/hash/3c4688b6a76f25f2311daa0d75a58f1a-Abstract-Datasets_and_Benchmarks.html) |
|
benchmark datasets for image classification. |
|
|
|
### Results |
|
|
|
#### Zero-shot Cross-Modal Retrieval |
|
|
|
We conducted zero-shot cross-modal retrieval experiments on Flickr30k and MS COCO with captions |
|
translated into Portuguese, and PraCegoVer. We report the average and standard deviation for 3 runs. |
|
|
|
<table> |
|
<thead> |
|
<tr> |
|
<th>Models</th> |
|
<th colspan="2">Flickr30k</th> |
|
<th colspan="2"> MS COCO</th> |
|
<th colspan="2">PraCegoVer</th> |
|
</tr> |
|
</thead> |
|
<tbody> |
|
<tr> |
|
<td></td> |
|
<td>text-to-image</td> |
|
<td> image-to-text</td> |
|
<td>text-to-image</td> |
|
<td> image-to-text</td> |
|
<td>text-to-image</td> |
|
<td> image-to-text</td> |
|
</tr> |
|
<tr> |
|
<td>OpenCLIP ViT-B/32 XLM-Roberta Base (Baseline)</td> |
|
<td>76.23</td> |
|
<td>87.93</td> |
|
<td>52.62</td> |
|
<td>66.55</td> |
|
<td>65.36</td> |
|
<td><b>69.43</b></td> |
|
</tr> |
|
<tr> |
|
<td>CAPIVARA</td> |
|
<td><b>79.56 ± 0.01</b></td> |
|
<td><b>89.95 ± 0.04</b></td> |
|
<td><b>56.27 ± 0.01</b></td> |
|
<td><b>71.24 ± 0.01</b></td> |
|
<td><b>66.40 ± 0.01</b></td> |
|
<td>64.75 ± 0.01</td> |
|
</tr> |
|
</tbody> |
|
</table> |
|
|
|
#### Zero-shot image classification |
|
|
|
| Models | **Caltech-101** | **CIFAR-10** | **CIFAR-100** | **Country-211** | **DTD** | **EuroSAT** | **FER-2013** | **FGVC-Aircraft** | **Food-101** | **GTSRB** | **Hateful-Memes** | **KITTI-Distance** | **MNIST** | **Oxford Flowers-102** | **Oxford-IIIT Pets** | **PatchCamelyon** | **Rendered-SST2** | **RESISC-45** | **Stanford-Cars** | **PASCAL VOC-2007** | **Average** | **ImageNet-1k** | |
|
|:-----------------------:|:---------------:|:------------:|:-------------:|:---------------:|:------------:|:------------:|:------------:|:-----------------:|:------------:|:------------:|:-----------------:|:------------------:|:------------:|:----------------------:|:--------------------:|:-----------------:|:-----------------:|:-------------:|:-----------------:|:-------------------:|:------------:|:---------------:| |
|
| OpenCLIP ViT-B/32 XLM-Roberta Base (Baseline) | 84.53 ± 0.00 | 93.99 ± 0.00 | 68.44 ± 0.00 | 17.82 ± 0.00 | 41.17 ± 0.00 | 47.16 ± 0.00 | 48.65 ± 0.00 | 26.30 ± 0.00 | 65.06 ± 0.00 | 43.27 ± 0.00 | 56.50 ± 0.00 | 28.41 ± 0.00 | 54.99 ± 0.00 | 50.88 ± 0.00 | 81.56 ± 0.00 | 50.96 ± 0.00 | 54.20 ± 0.00 | 58.51 ± 0.00 | 84.93 ± 0.00 | 82.09 ± 0.00 | 56.97 ± 0.00 | 45.84 ± 0.00 | |
|
| CAPIVARA | 82.97 ± 0.03 | 93.85 ± 0.00 | 69.37 ± 0.01 | 17.61 ± 0.00 | 42.34 ± 0.04 | 47.77 ± 0.02 | 46.68 ± 0.05 | 25.49 ± 0.01 | 64.58 ± 0.01 | 46.34 ± 0.01 | 56.17 ± 0.00 | 33.94 ± 0.13 | 60.14 ± 0.04 | 49.93 ± 0.02 | 79.37 ± 0.00 | 51.71 ± 0.01 | 54.82 ± 0.03 | 59.71 ± 0.01 | 85.10 ± 0.02 | 82.29 ± 0.00 | **57.51 ± 0.02** | **46.06 ± 0.01** | |
|
|
|
## Environmental Impact |
|
|
|
- **GPU:** 1 x Quadro RTX 8000 (48 GB) |
|
- **Hours used:** 31 hours |
|
- **Compute Region:** Brazil |
|
- **Carbon footprint:** 0.5 Kg |
|
- **Energy**: 6.49 kW |
|
|
|
|
|
## Citation |
|
|
|
```bibtex |
|
@inproceedings{santos2023capivara, |
|
title={CAPIVARA: Cost-Efficient Approach for Improving Multilingual CLIP Performance on Low-Resource Languages}, |
|
author={Santos, Gabriel O. dos and Moreira, Diego A. B. and Ferreira, Alef I. and Silva, Jhessica and Pereira, Luiz and Bueno, Pedro and Sousa, Thiago and Maia, Helena and da Silva, N{\'a}dia and Colombini, Esther and Pedrini, Helio and Avila, Sandra}, |
|
booktitle = "Workshop on Multi-lingual Representation Learning (MRL), Conference on Empirical Methods in Natural Language Processing (EMNLP)", |
|
year = "2023" |
|
} |
|
``` |