File size: 6,813 Bytes
4cd49e4 0982c52 4cd49e4 0982c52 ce250ad 726ff2b ce250ad 726ff2b ce250ad bacdc0c 0982c52 8d61017 0982c52 |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 |
---
license: mit
language:
- pt
tags:
- CAPIVARA
- Portuguese CLIP
- OpenCLIP
datasets:
- conceptual_captions
- PraCegoVer
- MS_COCO
- Flickr30K
- ImageNet
- ELEVATER
---
# Model Card for CAPIVARA
CAPIVARA is a cost-efficient framework designed to enhance the performance of multilingual CLIP models in low-resource languages.
This model holds the state of the art in many zero-shot tasks involving images and Portuguese texts.
## How to use
```python
import open_clip
model, preprocess_train, preprocess_val = open_clip.create_model_and_transforms('hf-hub:hiaac-nlp/CAPIVARA')
tokenizer = open_clip.get_tokenizer('hf-hub:hiaac-nlp/CAPIVARA')
```
For more details refer to [Github repo](https://github.com/hiaac-nlp/CAPIVARA/).
## Model Details
### Model Description
CAPIVARA is built upon pre-trained [OpenCLIP ViT-B/32 XLM-Roberta Base](https://huggingface.co/laion/CLIP-ViT-B-32-xlm-roberta-base-laion5B-s13B-b90k) and
fine-tuned with [Conceptual Captions](https://aclanthology.org/P18-1238.pdf) and synthetic captions generated by [BLIP2](https://huggingface.co/Salesforce/blip2-opt-2.7b-coco).
All the captions are translated with Google Translator.
## Uses
### Direct Use
Zero-shot image classification, zero-shot image and text retrieval, etc.
### Downstream Use
Image classification and other image task fine-tuning, linear probe image classification,
image captioning, image generation guiding and conditioning, etc.
## Ethical considerations
For ethical considerations, please, refer to the Model Cards section in the [paper](https://arxiv.org/abs/2310.13683).
## Training Details
### Training Data
The model was fine-tuned with [Conceptual Captions](https://aclanthology.org/P18-1238.pdf) and synthetic captions generated by [BLIP2](https://huggingface.co/Salesforce/blip2-opt-2.7b-coco).
All the captions are translated with Google Translator.
#### Training Hyperparameters
```
Optimizer: "Adam"
eps: 1e-8
weight_decay: 0.2
betas: [ 0.9, 0.98 ]
LR_scheduler: "CosineWarmupLR"
min_learning_rate: 1e-7
max_learning_rate: 5e-7
warmup_lr: 500
batch_size: 2816
max_steps: 5863 # 10 epochs
```
## Evaluation
+ [Zero-shot image classification](https://github.com/hiaac-nlp/CAPIVARA/blob/main/clip_pt/src/evaluate/capivara_classification.ipynb)
+ [Zero-shot cross-modal retrieval](https://github.com/hiaac-nlp/CAPIVARA/blob/main/clip_pt/src/evaluate/capivara_retrieval.ipynb)
### Testing Data, Factors & Metrics
#### Testing Data
For cross-modal retrieval, we used [PraCegoVer](https://www.mdpi.com/2306-5729/7/2/13), which is composed of images annotated originally
with Portuguese texts, and our Portuguese-translated versions of [MS COCO](https://link.springer.com/chapter/10.1007/978-3-319-10602-1_48)
and [Flickr30k](https://openaccess.thecvf.com/content_iccv_2015/papers/Plummer_Flickr30k_Entities_Collecting_ICCV_2015_paper.pdf).
We also translate the labels from [ImageNet](https://ieeexplore.ieee.org/document/5206848) and the
[ELEVATER](https://proceedings.neurips.cc/paper_files/paper/2022/hash/3c4688b6a76f25f2311daa0d75a58f1a-Abstract-Datasets_and_Benchmarks.html)
benchmark datasets for image classification.
### Results
#### Zero-shot Cross-Modal Retrieval
We conducted zero-shot cross-modal retrieval experiments on Flickr30k and MS COCO with captions
translated into Portuguese, and PraCegoVer. We report the average and standard deviation for 3 runs.
<table>
<thead>
<tr>
<th>Models</th>
<th colspan="2">Flickr30k</th>
<th colspan="2"> MS COCO</th>
<th colspan="2">PraCegoVer</th>
</tr>
</thead>
<tbody>
<tr>
<td></td>
<td>text-to-image</td>
<td> image-to-text</td>
<td>text-to-image</td>
<td> image-to-text</td>
<td>text-to-image</td>
<td> image-to-text</td>
</tr>
<tr>
<td>OpenCLIP ViT-B/32 XLM-Roberta Base (Baseline)</td>
<td>76.23</td>
<td>87.93</td>
<td>52.62</td>
<td>66.55</td>
<td>65.36</td>
<td><b>69.43</b></td>
</tr>
<tr>
<td>CAPIVARA</td>
<td><b>79.56 ± 0.01</b></td>
<td><b>89.95 ± 0.04</b></td>
<td><b>56.27 ± 0.01</b></td>
<td><b>71.24 ± 0.01</b></td>
<td><b>66.40 ± 0.01</b></td>
<td>64.75 ± 0.01</td>
</tr>
</tbody>
</table>
#### Zero-shot image classification
| Models | **Caltech-101** | **CIFAR-10** | **CIFAR-100** | **Country-211** | **DTD** | **EuroSAT** | **FER-2013** | **FGVC-Aircraft** | **Food-101** | **GTSRB** | **Hateful-Memes** | **KITTI-Distance** | **MNIST** | **Oxford Flowers-102** | **Oxford-IIIT Pets** | **PatchCamelyon** | **Rendered-SST2** | **RESISC-45** | **Stanford-Cars** | **PASCAL VOC-2007** | **Average** | **ImageNet-1k** |
|:-----------------------:|:---------------:|:------------:|:-------------:|:---------------:|:------------:|:------------:|:------------:|:-----------------:|:------------:|:------------:|:-----------------:|:------------------:|:------------:|:----------------------:|:--------------------:|:-----------------:|:-----------------:|:-------------:|:-----------------:|:-------------------:|:------------:|:---------------:|
| OpenCLIP ViT-B/32 XLM-Roberta Base (Baseline) | 84.53 ± 0.00 | 93.99 ± 0.00 | 68.44 ± 0.00 | 17.82 ± 0.00 | 41.17 ± 0.00 | 47.16 ± 0.00 | 48.65 ± 0.00 | 26.30 ± 0.00 | 65.06 ± 0.00 | 43.27 ± 0.00 | 56.50 ± 0.00 | 28.41 ± 0.00 | 54.99 ± 0.00 | 50.88 ± 0.00 | 81.56 ± 0.00 | 50.96 ± 0.00 | 54.20 ± 0.00 | 58.51 ± 0.00 | 84.93 ± 0.00 | 82.09 ± 0.00 | 56.97 ± 0.00 | 45.84 ± 0.00 |
| CAPIVARA | 82.97 ± 0.03 | 93.85 ± 0.00 | 69.37 ± 0.01 | 17.61 ± 0.00 | 42.34 ± 0.04 | 47.77 ± 0.02 | 46.68 ± 0.05 | 25.49 ± 0.01 | 64.58 ± 0.01 | 46.34 ± 0.01 | 56.17 ± 0.00 | 33.94 ± 0.13 | 60.14 ± 0.04 | 49.93 ± 0.02 | 79.37 ± 0.00 | 51.71 ± 0.01 | 54.82 ± 0.03 | 59.71 ± 0.01 | 85.10 ± 0.02 | 82.29 ± 0.00 | **57.51 ± 0.02** | **46.06 ± 0.01** |
## Environmental Impact
- **GPU:** 1 x Quadro RTX 8000 (48 GB)
- **Hours used:** 31 hours
- **Compute Region:** Brazil
- **Carbon footprint:** 0.5 Kg
- **Energy**: 6.49 kW
## Citation
```bibtex
@inproceedings{santos2023capivara,
title={CAPIVARA: Cost-Efficient Approach for Improving Multilingual CLIP Performance on Low-Resource Languages},
author={Santos, Gabriel O. dos and Moreira, Diego A. B. and Ferreira, Alef I. and Silva, Jhessica and Pereira, Luiz and Bueno, Pedro and Sousa, Thiago and Maia, Helena and da Silva, N{\'a}dia and Colombini, Esther and Pedrini, Helio and Avila, Sandra},
booktitle = "Workshop on Multi-lingual Representation Learning (MRL), Conference on Empirical Methods in Natural Language Processing (EMNLP)",
year = "2023"
}
``` |