File size: 6,813 Bytes
4cd49e4
 
0982c52
 
 
 
 
 
 
 
 
 
 
 
 
4cd49e4
0982c52
 
 
 
 
ce250ad
 
726ff2b
ce250ad
726ff2b
 
ce250ad
 
bacdc0c
 
 
0982c52
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
8d61017
0982c52
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
---
license: mit
language:
- pt
tags:
- CAPIVARA
- Portuguese CLIP
- OpenCLIP
datasets:
- conceptual_captions
- PraCegoVer
- MS_COCO
- Flickr30K
- ImageNet
- ELEVATER
---
# Model Card for CAPIVARA

CAPIVARA is a cost-efficient framework designed to enhance the performance of multilingual CLIP models in low-resource languages. 
This model  holds the state of the art in many zero-shot tasks involving images and Portuguese texts.

## How to use
```python
import open_clip

model, preprocess_train, preprocess_val = open_clip.create_model_and_transforms('hf-hub:hiaac-nlp/CAPIVARA')
tokenizer = open_clip.get_tokenizer('hf-hub:hiaac-nlp/CAPIVARA')
```

For more details refer to [Github repo](https://github.com/hiaac-nlp/CAPIVARA/).


## Model Details

### Model Description

CAPIVARA is built upon pre-trained [OpenCLIP ViT-B/32 XLM-Roberta Base](https://huggingface.co/laion/CLIP-ViT-B-32-xlm-roberta-base-laion5B-s13B-b90k) and 
fine-tuned with [Conceptual Captions](https://aclanthology.org/P18-1238.pdf) and synthetic captions generated by [BLIP2](https://huggingface.co/Salesforce/blip2-opt-2.7b-coco).
All the captions are translated with Google Translator.


## Uses

### Direct Use

Zero-shot image classification, zero-shot image and text retrieval, etc.

### Downstream Use

Image classification and other image task fine-tuning, linear probe image classification, 
image captioning, image generation guiding and conditioning, etc.


## Ethical considerations

For ethical considerations, please, refer to the Model Cards section in the [paper](https://arxiv.org/abs/2310.13683).

## Training Details

### Training Data
The model was fine-tuned with [Conceptual Captions](https://aclanthology.org/P18-1238.pdf) and synthetic captions generated by [BLIP2](https://huggingface.co/Salesforce/blip2-opt-2.7b-coco).
All the captions are translated with Google Translator.

#### Training Hyperparameters
```
Optimizer: "Adam"
eps: 1e-8
weight_decay: 0.2
betas: [ 0.9, 0.98 ]

LR_scheduler: "CosineWarmupLR"
min_learning_rate: 1e-7
max_learning_rate: 5e-7
warmup_lr: 500

batch_size: 2816
max_steps: 5863 # 10 epochs
```

## Evaluation

+ [Zero-shot image classification](https://github.com/hiaac-nlp/CAPIVARA/blob/main/clip_pt/src/evaluate/capivara_classification.ipynb)
+ [Zero-shot cross-modal retrieval](https://github.com/hiaac-nlp/CAPIVARA/blob/main/clip_pt/src/evaluate/capivara_retrieval.ipynb)


### Testing Data, Factors & Metrics

#### Testing Data

For cross-modal retrieval, we used [PraCegoVer](https://www.mdpi.com/2306-5729/7/2/13), which is composed of images annotated originally 
with Portuguese texts, and our Portuguese-translated versions of [MS COCO](https://link.springer.com/chapter/10.1007/978-3-319-10602-1_48) 
and [Flickr30k](https://openaccess.thecvf.com/content_iccv_2015/papers/Plummer_Flickr30k_Entities_Collecting_ICCV_2015_paper.pdf). 
We also translate the labels from [ImageNet](https://ieeexplore.ieee.org/document/5206848) and the  
[ELEVATER](https://proceedings.neurips.cc/paper_files/paper/2022/hash/3c4688b6a76f25f2311daa0d75a58f1a-Abstract-Datasets_and_Benchmarks.html) 
benchmark datasets for image classification.

### Results

#### Zero-shot Cross-Modal Retrieval

We conducted zero-shot cross-modal retrieval experiments on Flickr30k and MS COCO with captions
translated into Portuguese, and PraCegoVer. We report the average and standard deviation for 3 runs.

<table>
<thead>
  <tr>
    <th>Models</th>
    <th colspan="2">Flickr30k</th>
    <th colspan="2"> MS COCO</th>
    <th colspan="2">PraCegoVer</th>
  </tr>
</thead>
<tbody>
  <tr>
    <td></td>
    <td>text-to-image</td>
    <td> image-to-text</td>
    <td>text-to-image</td>
    <td> image-to-text</td>
    <td>text-to-image</td>
    <td> image-to-text</td>
  </tr>
  <tr>
    <td>OpenCLIP ViT-B/32 XLM-Roberta Base (Baseline)</td>
    <td>76.23</td>
    <td>87.93</td>
    <td>52.62</td>
    <td>66.55</td>
    <td>65.36</td>
    <td><b>69.43</b></td>
  </tr>
  <tr>
    <td>CAPIVARA</td>
    <td><b>79.56 ± 0.01</b></td>
    <td><b>89.95 ± 0.04</b></td>
    <td><b>56.27 ± 0.01</b></td>
    <td><b>71.24 ± 0.01</b></td>
    <td><b>66.40 ± 0.01</b></td>
    <td>64.75 ± 0.01</td>
  </tr>
</tbody>
</table>

#### Zero-shot image classification

|            Models             | **Caltech-101** | **CIFAR-10** | **CIFAR-100** | **Country-211** | **DTD**      | **EuroSAT**  | **FER-2013** | **FGVC-Aircraft** | **Food-101** | **GTSRB**    | **Hateful-Memes** | **KITTI-Distance** | **MNIST**    | **Oxford Flowers-102** | **Oxford-IIIT Pets** | **PatchCamelyon** | **Rendered-SST2** | **RESISC-45** | **Stanford-Cars** | **PASCAL VOC-2007** | **Average**  | **ImageNet-1k** |
|:-----------------------:|:---------------:|:------------:|:-------------:|:---------------:|:------------:|:------------:|:------------:|:-----------------:|:------------:|:------------:|:-----------------:|:------------------:|:------------:|:----------------------:|:--------------------:|:-----------------:|:-----------------:|:-------------:|:-----------------:|:-------------------:|:------------:|:---------------:|
| OpenCLIP ViT-B/32 XLM-Roberta Base (Baseline) | 84.53 ± 0.00    | 93.99 ± 0.00 | 68.44 ± 0.00  | 17.82 ± 0.00    | 41.17 ± 0.00 | 47.16 ± 0.00 | 48.65 ± 0.00 | 26.30 ± 0.00      | 65.06 ± 0.00 | 43.27 ± 0.00 | 56.50 ± 0.00      | 28.41 ± 0.00       | 54.99 ± 0.00 | 50.88 ± 0.00           | 81.56 ± 0.00         | 50.96 ± 0.00      | 54.20 ± 0.00      | 58.51 ± 0.00  | 84.93 ± 0.00      | 82.09 ± 0.00        | 56.97 ± 0.00 | 45.84 ± 0.00    |
|      CAPIVARA       | 82.97 ± 0.03    | 93.85 ± 0.00 | 69.37 ± 0.01  | 17.61 ± 0.00    | 42.34 ± 0.04 | 47.77 ± 0.02 | 46.68 ± 0.05 | 25.49 ± 0.01      | 64.58 ± 0.01 | 46.34 ± 0.01 | 56.17 ± 0.00      | 33.94 ± 0.13       | 60.14 ± 0.04 | 49.93 ± 0.02           | 79.37 ± 0.00         | 51.71 ± 0.01      | 54.82 ± 0.03      | 59.71 ± 0.01  | 85.10 ± 0.02      | 82.29 ± 0.00        | **57.51 ± 0.02** | **46.06 ± 0.01**    |

## Environmental Impact

- **GPU:** 1 x Quadro RTX 8000 (48 GB)
- **Hours used:** 31 hours
- **Compute Region:** Brazil
- **Carbon footprint:** 0.5 Kg
- **Energy**: 6.49 kW


## Citation

```bibtex
@inproceedings{santos2023capivara,
  title={CAPIVARA: Cost-Efficient Approach for Improving Multilingual CLIP Performance on Low-Resource Languages},
  author={Santos, Gabriel O. dos and Moreira, Diego A. B. and Ferreira, Alef I. and Silva, Jhessica and Pereira, Luiz and Bueno, Pedro and Sousa, Thiago and Maia, Helena and da Silva, N{\'a}dia and Colombini, Esther and Pedrini, Helio and Avila, Sandra},
  booktitle = "Workshop on Multi-lingual Representation Learning (MRL), Conference on Empirical Methods in Natural Language Processing (EMNLP)",
  year = "2023"
}
```