gabrielsantosrv
commited on
Commit
·
0982c52
1
Parent(s):
a9902b6
Update README.md
Browse files
README.md
CHANGED
@@ -1,3 +1,159 @@
|
|
1 |
---
|
2 |
license: mit
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
3 |
---
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
---
|
2 |
license: mit
|
3 |
+
language:
|
4 |
+
- pt
|
5 |
+
tags:
|
6 |
+
- CAPIVARA
|
7 |
+
- Portuguese CLIP
|
8 |
+
- Portuguese
|
9 |
+
- OpenCLIP
|
10 |
+
datasets:
|
11 |
+
- conceptual_captions
|
12 |
+
- PraCegoVer
|
13 |
+
- MS_COCO
|
14 |
+
- Flickr30K
|
15 |
+
- ImageNet
|
16 |
+
- ELEVATER
|
17 |
---
|
18 |
+
# Model Card for CAPIVARA
|
19 |
+
|
20 |
+
CAPIVARA is a cost-efficient framework designed to enhance the performance of multilingual CLIP models in low-resource languages.
|
21 |
+
This model holds the state of the art in many zero-shot tasks involving images and Portuguese texts.
|
22 |
+
|
23 |
+
## Model Details
|
24 |
+
|
25 |
+
### Model Description
|
26 |
+
|
27 |
+
CAPIVARA is built upon pre-trained [OpenCLIP ViT-B/32 XLM-Roberta Base](https://huggingface.co/laion/CLIP-ViT-B-32-xlm-roberta-base-laion5B-s13B-b90k) and
|
28 |
+
fine-tuned with [Conceptual Captions](https://aclanthology.org/P18-1238.pdf) and synthetic captions generated by [BLIP2](https://huggingface.co/Salesforce/blip2-opt-2.7b-coco).
|
29 |
+
All the captions are translated with Google Translator.
|
30 |
+
|
31 |
+
|
32 |
+
## Uses
|
33 |
+
|
34 |
+
### Direct Use
|
35 |
+
|
36 |
+
Zero-shot image classification, zero-shot image and text retrieval, etc.
|
37 |
+
|
38 |
+
### Downstream Use
|
39 |
+
|
40 |
+
Image classification and other image task fine-tuning, linear probe image classification,
|
41 |
+
image captioning, image generation guiding and conditioning, etc.
|
42 |
+
|
43 |
+
|
44 |
+
## Ethical considerations
|
45 |
+
|
46 |
+
For ethical considerations, please, read the Model Cards section in the [paper](https://arxiv.org/abs/2310.13683).
|
47 |
+
|
48 |
+
## Training Details
|
49 |
+
|
50 |
+
### Training Data
|
51 |
+
The model was fine-tuned with [Conceptual Captions](https://aclanthology.org/P18-1238.pdf) and synthetic captions generated by [BLIP2](https://huggingface.co/Salesforce/blip2-opt-2.7b-coco).
|
52 |
+
All the captions are translated with Google Translator.
|
53 |
+
|
54 |
+
#### Training Hyperparameters
|
55 |
+
```
|
56 |
+
Optimizer: "Adam"
|
57 |
+
eps: 1e-8
|
58 |
+
weight_decay: 0.2
|
59 |
+
betas: [ 0.9, 0.98 ]
|
60 |
+
|
61 |
+
LR_scheduler: "CosineWarmupLR"
|
62 |
+
min_learning_rate: 1e-7
|
63 |
+
max_learning_rate: 5e-7
|
64 |
+
warmup_lr: 500
|
65 |
+
|
66 |
+
batch_size: 2816
|
67 |
+
max_steps: 5863 # 10 epochs
|
68 |
+
```
|
69 |
+
|
70 |
+
## Evaluation
|
71 |
+
|
72 |
+
+ [Zero-shot image classification](https://github.com/hiaac-nlp/CAPIVARA/blob/main/clip_pt/src/evaluate/capivara_classification.ipynb)
|
73 |
+
+ [Zero-shot cross-modal retrieval](https://github.com/hiaac-nlp/CAPIVARA/blob/main/clip_pt/src/evaluate/capivara_retrieval.ipynb)
|
74 |
+
|
75 |
+
|
76 |
+
### Testing Data, Factors & Metrics
|
77 |
+
|
78 |
+
#### Testing Data
|
79 |
+
|
80 |
+
For cross-modal retrieval, we used [PraCegoVer](https://www.mdpi.com/2306-5729/7/2/13), which is composed of images annotated originally
|
81 |
+
with Portuguese texts, and our Portuguese-translated versions of [MS COCO](https://link.springer.com/chapter/10.1007/978-3-319-10602-1_48)
|
82 |
+
and [Flickr30k](https://openaccess.thecvf.com/content_iccv_2015/papers/Plummer_Flickr30k_Entities_Collecting_ICCV_2015_paper.pdf).
|
83 |
+
We also translate the labels from [ImageNet](https://ieeexplore.ieee.org/document/5206848) and the
|
84 |
+
[ELEVATER](https://proceedings.neurips.cc/paper_files/paper/2022/hash/3c4688b6a76f25f2311daa0d75a58f1a-Abstract-Datasets_and_Benchmarks.html)
|
85 |
+
benchmark datasets for image classification.
|
86 |
+
|
87 |
+
### Results
|
88 |
+
|
89 |
+
#### Zero-shot Cross-Modal Retrieval
|
90 |
+
|
91 |
+
We conducted zero-shot cross-modal retrieval experiments on Flickr30k and MS COCO with captions
|
92 |
+
translated into Portuguese, and PraCegoVer. We report the average and standard deviation for 3 runs.
|
93 |
+
|
94 |
+
<table>
|
95 |
+
<thead>
|
96 |
+
<tr>
|
97 |
+
<th>Models</th>
|
98 |
+
<th colspan="2">Flickr30k</th>
|
99 |
+
<th colspan="2"> MS COCO</th>
|
100 |
+
<th colspan="2">PraCegoVer</th>
|
101 |
+
</tr>
|
102 |
+
</thead>
|
103 |
+
<tbody>
|
104 |
+
<tr>
|
105 |
+
<td></td>
|
106 |
+
<td>text-to-image</td>
|
107 |
+
<td> image-to-text</td>
|
108 |
+
<td>text-to-image</td>
|
109 |
+
<td> image-to-text</td>
|
110 |
+
<td>text-to-image</td>
|
111 |
+
<td> image-to-text</td>
|
112 |
+
</tr>
|
113 |
+
<tr>
|
114 |
+
<td>OpenCLIP ViT-B/32 XLM-Roberta Base (Baseline)</td>
|
115 |
+
<td>76.23</td>
|
116 |
+
<td>87.93</td>
|
117 |
+
<td>52.62</td>
|
118 |
+
<td>66.55</td>
|
119 |
+
<td>65.36</td>
|
120 |
+
<td><b>69.43</b></td>
|
121 |
+
</tr>
|
122 |
+
<tr>
|
123 |
+
<td>CAPIVARA</td>
|
124 |
+
<td><b>79.56 ± 0.01</b></td>
|
125 |
+
<td><b>89.95 ± 0.04</b></td>
|
126 |
+
<td><b>56.27 ± 0.01</b></td>
|
127 |
+
<td><b>71.24 ± 0.01</b></td>
|
128 |
+
<td><b>66.40 ± 0.01</b></td>
|
129 |
+
<td>64.75 ± 0.01</td>
|
130 |
+
</tr>
|
131 |
+
</tbody>
|
132 |
+
</table>
|
133 |
+
|
134 |
+
#### Zero-shot image classification
|
135 |
+
|
136 |
+
| Models | **Caltech-101** | **CIFAR-10** | **CIFAR-100** | **Country-211** | **DTD** | **EuroSAT** | **FER-2013** | **FGVC-Aircraft** | **Food-101** | **GTSRB** | **Hateful-Memes** | **KITTI-Distance** | **MNIST** | **Oxford Flowers-102** | **Oxford-IIIT Pets** | **PatchCamelyon** | **Rendered-SST2** | **RESISC-45** | **Stanford-Cars** | **PASCAL VOC-2007** | **Average** | **ImageNet-1k** |
|
137 |
+
|:-----------------------:|:---------------:|:------------:|:-------------:|:---------------:|:------------:|:------------:|:------------:|:-----------------:|:------------:|:------------:|:-----------------:|:------------------:|:------------:|:----------------------:|:--------------------:|:-----------------:|:-----------------:|:-------------:|:-----------------:|:-------------------:|:------------:|:---------------:|
|
138 |
+
| OpenCLIP ViT-B/32 XLM-Roberta Base (Baseline) | 84.53 ± 0.00 | 93.99 ± 0.00 | 68.44 ± 0.00 | 17.82 ± 0.00 | 41.17 ± 0.00 | 47.16 ± 0.00 | 48.65 ± 0.00 | 26.30 ± 0.00 | 65.06 ± 0.00 | 43.27 ± 0.00 | 56.50 ± 0.00 | 28.41 ± 0.00 | 54.99 ± 0.00 | 50.88 ± 0.00 | 81.56 ± 0.00 | 50.96 ± 0.00 | 54.20 ± 0.00 | 58.51 ± 0.00 | 84.93 ± 0.00 | 82.09 ± 0.00 | 56.97 ± 0.00 | 45.84 ± 0.00 |
|
139 |
+
| CAPIVARA | 82.97 ± 0.03 | 93.85 ± 0.00 | 69.37 ± 0.01 | 17.61 ± 0.00 | 42.34 ± 0.04 | 47.77 ± 0.02 | 46.68 ± 0.05 | 25.49 ± 0.01 | 64.58 ± 0.01 | 46.34 ± 0.01 | 56.17 ± 0.00 | 33.94 ± 0.13 | 60.14 ± 0.04 | 49.93 ± 0.02 | 79.37 ± 0.00 | 51.71 ± 0.01 | 54.82 ± 0.03 | 59.71 ± 0.01 | 85.10 ± 0.02 | 82.29 ± 0.00 | **57.51 ± 0.02** | **46.06 ± 0.01** |
|
140 |
+
|
141 |
+
## Environmental Impact
|
142 |
+
|
143 |
+
- **GPU:** 1 x Quadro RTX 8000 (48 GB)
|
144 |
+
- **Hours used:** 31 hours
|
145 |
+
- **Compute Region:** Brazil
|
146 |
+
- **Carbon footprint:** 0.5 Kg
|
147 |
+
- **Energy**: 6.49 kW
|
148 |
+
|
149 |
+
|
150 |
+
## Citation
|
151 |
+
|
152 |
+
```bibtex
|
153 |
+
@inproceedings{santos2023capivara,
|
154 |
+
title={CAPIVARA: Cost-Efficient Approach for Improving Multilingual CLIP Performance on Low-Resource Languages},
|
155 |
+
author={Santos, Gabriel O. dos and Moreira, Diego A. B. and Ferreira, Alef I. and Silva, Jhessica and Pereira, Luiz and Bueno, Pedro and Sousa, Thiago and Maia, Helena and da Silva, N{\'a}dia and Colombini, Esther and Pedrini, Helio and Avila, Sandra},
|
156 |
+
booktitle = "Workshop on Multi-lingual Representation Learning (MRL), Conference on Empirical Methods in Natural Language Processing (EMNLP)",
|
157 |
+
year = "2023"
|
158 |
+
}
|
159 |
+
```
|