File size: 12,338 Bytes
e1aaaac
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
# RobustVLM
[[Paper]](https://arxiv.org/abs/2402.12336) [[HuggingFace]](https://huggingface.co/collections/chs20/robust-clip-65d913e552eca001fdc41978) [[BibTeX]](#citation) 

This repository contains code for the paper "Robust CLIP: Unsupervised Adversarial Fine-Tuning of Vision Embeddings for Robust Large Vision-Language Models" (_Oral@ICML 2024_).

<p align="center">
    <img src="assets/teaser0.png" width="500">
    <br>
</p>

******

<p align="center">
    <img src="assets/teaser1.png" width="800">
</p>


We fine-tune CLIP in an unsupervised manner to improve its robustness to visual adversarial attacks.
We show that replacing the vision encoder of large vision-language models with our fine-tuned CLIP models yields state-of-the-art 
adversarial robustness on a variety of vision-language tasks, without requiring any training of the large VLMs themselves.
Moreover, we improve the robustness of CLIP to adversarial attacks in zero-shot classification settings, while maintaining 
higher clean accuracy than previous adversarial fine-tuning methods.

## Table of Contents
- [Installation](#installation)
- [Models](#models)
    - [Loading pretrained models](#loading-pretrained-models)
    - [Summary of results](#summary-of-results)
- [Training](#training)
- [Evaluation](#evaluation)

## Installation
The code is tested with Python 3.11. To install the required packages, run:
```shell
pip install -r requirements.txt
```

## Models
We provide the following adversarially fine-tuned ViT-L/14 CLIP models (approx. 1.1 GB each):

| Model             | Link                                                                                             | Proposed by                                            | Notes                                                                                     |
|-------------------|--------------------------------------------------------------------------------------------------|--------------------------------------------------------|-------------------------------------------------------------------------------------------|
| TeCoA<sup>2</sup> | [Link](https://nc.mlcloud.uni-tuebingen.de/index.php/s/5SQzfAbp8JHS3o7/download/tecoa_eps_2.pt)  | [Mao et al. (2023)](https://arxiv.org/abs/2212.07016)  | Supervised adversarial fine-tuning with $\ell_\infty$ norm, $\varepsilon=\frac{2}{255}$   |
| TeCoA<sup>4</sup> | [Link](https://nc.mlcloud.uni-tuebingen.de/index.php/s/92req4Pak5i56tX/download/tecoa_eps_4.pt)  | [Mao et al. (2023)](https://arxiv.org/abs/2212.07016)  | Supervised adversarial fine-tuning with $\ell_\infty$ norm, $\varepsilon=\frac{4}{255}$   |
| FARE<sup>2</sup>  | [Link](https://nc.mlcloud.uni-tuebingen.de/index.php/s/d83Lqm8Jpowxp4m/download/fare_eps_2.pt)   | ours                                                   | Unsupervised adversarial fine-tuning with $\ell_\infty$ norm, $\varepsilon=\frac{2}{255}$ |
| FARE<sup>4</sup>  | [Link](https://nc.mlcloud.uni-tuebingen.de/index.php/s/jnQ2qmp9tst8kyQ/download/fare_eps_4.pt)   | ours                                                   | Unsupervised adversarial fine-tuning with $\ell_\infty$ norm, $\varepsilon=\frac{4}{255}$ |

The models are also available on [HuggingFace](https://huggingface.co/collections/chs20/robust-clip-65d913e552eca001fdc41978).

All models are adversarially fine-tuned for two epochs on ImageNet. TeCoA is trained in a supervised fashion, utilizing ImageNet class labels. FARE, in contrast, does not require any labels for training.

### Loading pretrained models
The provided checkpoints correspond to the vision encoder of CLIP. To load the full CLIP model (including the text encoder), you can use the following code:
```python
import torch
from open_clip import create_model_and_transforms
model, _, image_processor = create_model_and_transforms(
            'ViT-L-14', pretrained='openai', device='cpu'
        )
checkpoint = torch.load('/path/to/fare_eps_2.pt', map_location=torch.device('cpu'))
model.visual.load_state_dict(checkpoint)
```
Alternatively load directly from HuggingFace:
```python
from open_clip import create_model_and_transforms
model, _, image_processor = open_clip.create_model_and_transforms('hf-hub:chs20/fare2-clip')
```

### Summary of results 
We show a summary of results on zero-shot classification and vision-language tasks for original and fine-tuned ViT-L/14 CLIP models. *CLIP-only* means that we evaluate 
the respective CLIP model in a standalone fashion for zero-shot classification, whereas *OpenFlamingo* and *LLaVA* evaluation means that we use the respective CLIP model 
as a vision encoder as part of these large vision-language models. Results for individual zero-shot datasets and more VLM tasks 
are provided in the paper.

- Clean evaluation:
<table>
    <tr>
        <td></td>
        <td>CLIP-only</td>
        <td colspan="2">OpenFlamingo 9B</td>
        <td colspan="2">LLaVA 1.5 7B</td>
    </tr>
    <tr>
        <td>Model</td>
        <td>Avg. zero-shot</td>
        <td>COCO</td>
        <td>TextVQA</td>
        <td>COCO</td>
        <td>TextVQA</td>
    </tr>
    <tr>
        <td>OpenAI</td>
        <td>73.1</td>
        <td>79.7</td>
        <td>23.8</td>
        <td>115.5</td>
        <td>37.1</td>
    </tr>
    <tr>
        <td>TeCoA<sup>2</sup></td>
        <td>60.0</td>
        <td>73.5</td>
        <td>16.6</td>
        <td>98.4</td>
        <td>24.1</td>
    </tr>
    <tr>
        <td>FARE<sup>2</sup></td>
        <td>67.0</td>
        <td>79.1</td>
        <td>21.6</td>
        <td>109.9</td>
        <td>31.9</td>
    </tr>
    <tr>
        <td>TeCoA<sup>4</sup></td>
        <td>54.2</td>
        <td>66.9</td>
        <td>15.4</td>
        <td>88.3</td>
        <td>20.7</td>
    </tr>
    <tr>
        <td>FARE<sup>4</sup></td>
        <td>61.1</td>
        <td>74.1</td>
        <td>18.6</td>
        <td>102.4</td>
        <td>27.6</td>
    </tr>
</table>

- Adversarial evaluation ($\ell_\infty, ~ \varepsilon=\frac{2}{255}$):
<table>
    <tr>
        <td></td>
        <td>CLIP-only</td>
        <td colspan="2">OpenFlamingo 9B</td>
        <td colspan="2">LLaVA 1.5 7B</td>
    </tr>
    <tr>
        <td>Model</td>
        <td>Avg. zero-shot</td>
        <td>COCO</td>
        <td>TextVQA</td>
        <td>COCO</td>
        <td>TextVQA</td>
    </tr>
    <tr>
        <td>Openai</td>
        <td>0.0</td>
        <td>1.5</td>
        <td>0.0</td>
        <td>4.0</td>
        <td>0.5</td>
    </tr>
    <tr>
        <td>TeCoA<sup>2</sup></td>
        <td>43.6</td>
        <td>31.6</td>
        <td>3.5</td>
        <td>44.2</td>
        <td>12.1</td>
    </tr>
    <tr>
        <td>FARE<sup>2</sup></td>
        <td>43.1</td>
        <td>34.2</td>
        <td>4.1</td>
        <td>53.6</td>
        <td>14.7</td>
    </tr>
    <tr>
        <td>TeCoA<sup>4</sup></td>
        <td>42.3</td>
        <td>28.5</td>
        <td>2.1</td>
        <td>50.9</td>
        <td>12.6</td>
    </tr>
    <tr>
        <td>FARE<sup>4</sup></td>
        <td>45.9</td>
        <td>30.9</td>
        <td>3.4</td>
        <td>57.1</td>
        <td>15.8</td>
    </tr>
</table>

- Adversarial evaluation ($\ell_\infty, ~ \varepsilon=\frac{4}{255}$):
<table>
    <tr>
        <td></td>
        <td>CLIP-only</td>
        <td colspan="2">OpenFlamingo 9B</td>
        <td colspan="2">LLaVA 1.5 7B</td>
    </tr>
    <tr>
        <td>Model</td>
        <td>Avg. zero-shot</td>
        <td>COCO</td>
        <td>TextVQA</td>
        <td>COCO</td>
        <td>TextVQA</td>
    </tr>
    <tr>
        <td>Openai</td>
        <td>0.0</td>
        <td>1.1</td>
        <td>0.0</td>
        <td>3.1</td>
        <td>0.0</td>
    </tr>
    <tr>
        <td>TeCoA<sup>2</sup></td>
        <td>27.0</td>
        <td>21.2</td>
        <td>2.1</td>
        <td>30.3</td>
        <td>8.8</td>
    </tr>
    <tr>
        <td>FARE<sup>2</sup></td>
        <td>20.5</td>
        <td>19.5</td>
        <td>1.9</td>
        <td>31.0</td>
        <td>9.1</td>
    </tr>
    <tr>
        <td>TeCoA<sup>4</sup></td>
        <td>31.9</td>
        <td>21.6</td>
        <td>1.8</td>
        <td>35.3</td>
        <td>9.3</td>
    </tr>
    <tr>
        <td>FARE<sup>4</sup></td>
        <td>32.4</td>
        <td>22.8</td>
        <td>2.9</td>
        <td>40.9</td>
        <td>10.9</td>
    </tr>
</table>

## Training

- TeCoA<sup>4</sup>
```shell
python -m train.adversarial_training_clip --clip_model_name ViT-L-14 --pretrained openai --dataset imagenet --imagenet_root /path/to/imagenet --template std --output_normalize True --steps 20000 --warmup 1400 --batch_size 128 --loss ce --opt adamw --lr 1e-5 --wd 1e-4 --attack pgd --inner_loss ce --norm linf --eps 4 --iterations_adv 10 --stepsize_adv 1 --wandb False --output_dir /path/to/out/dir --experiment_name TECOA4 --log_freq 10 --eval_freq 10```
```

- FARE<sup>4</sup>
```shell
python -m train.adversarial_training_clip --clip_model_name ViT-L-14 --pretrained openai --dataset imagenet --imagenet_root /path/to/imagenet --template std --output_normalize False --steps 20000 --warmup 1400 --batch_size 128 --loss l2 --opt adamw --lr 1e-5 --wd 1e-4 --attack pgd --inner_loss l2 --norm linf --eps 4 --iterations_adv 10 --stepsize_adv 1 --wandb False --output_dir /path/to/out/dir --experiment_name FARE4 --log_freq 10 --eval_freq 10
```
Set `--eps 2` to obtain TeCoA<sup>2</sup> and FARE<sup>2</sup> models.

## Evaluation
Make sure files in `bash` directory are executable: `chmod +x bash/*`
### CLIP ImageNet
```shell
python -m CLIP_eval.clip_robustbench --clip_model_name ViT-L-14 --pretrained /path/to/ckpt.pt --dataset imagenet --imagenet_root /path/to/imagenet --wandb False --norm linf --eps 2
```

### CLIP Zero-Shot
Set models to be evaluated in `CLIP_benchmark/benchmark/models.txt` and datasets in `CLIP_benchmark/benchmark/datasets.txt`
(the datasets are downloaded from HuggingFace). Then run
```shell
cd CLIP_benchmark
./bash/run_benchmark_adv.sh
```

### VLM Captioning and VQA
#### LLaVA
In `/bash/llava_eval.sh` supply paths for the datasets. The required annotation files for the datasets can be obtained from this [HuggingFace repository](https://huggingface.co/datasets/openflamingo/eval_benchmark/tree/main).
Set `--vision_encoder_pretrained` to `openai` or supply path to fine-tuned CLIP model checkpoint.
Then run
```shell
./bash/llava_eval.sh
```
The LLaVA model will be automatically downloaded from HuggingFace.

#### OpenFlamingo
Download the OpenFlamingo 9B [model](https://huggingface.co/openflamingo/OpenFlamingo-9B-vitl-mpt7b/tree/main), supply paths in `/bash/of_eval_9B.sh` and run
```shell
./bash/of_eval_9B.sh
```

Some non-standard annotation files are supplied [here](https://nc.mlcloud.uni-tuebingen.de/index.php/s/mtRnQFaZJkR9zaX) and [here](https://github.com/mlfoundations/open_flamingo/tree/main/open_flamingo/eval/data).

### VLM Stealthy Targeted Attacks
For targeted attacks on COCO, run
```shell
./bash/llava_eval_targeted.sh
```
For targeted attacks on self-selected images, set images and target captions in `vlm_eval/run_evaluation_qualitative.py` and run
```shell
python -m vlm_eval.run_evaluation_qualitative --precision float32 --attack apgd --eps 2 --steps 10000 --vlm_model_name llava --vision_encoder_pretrained openai --verbose
```
With 10,000 iterations it takes about 2 hours per image on an A100 GPU.

### POPE
```shell
./bash/eval_pope.sh openai  # for clean  model evaluation
./bash/eval_pope.sh   # for robust  model evaluation - add path_to_ckpt in bash file
```
### SQA
```shell
./bash/eval_scienceqa.sh openai  # for clean  model evaluation
./bash/eval_scienceqa.sh   # for robust  model evaluation - add path_to_ckpt in bash file
```

## Acknowledgements
This repository gratefully forks from
- [OpenFlamingo](https://github.com/mlfoundations/open_flamingo)
- [LLaVA](https://github.com/haotian-liu/LLaVA)
- [CLIP Benchmark](https://github.com/LAION-AI/CLIP_benchmark)
- [AutoAttack](https://github.com/fra31/auto-attack)

## Citation
If you find this repository useful, please consider citing our paper:
```bibtex
@article{schlarmann2024robustclip,
    title={Robust CLIP: Unsupervised Adversarial Fine-Tuning of Vision Embeddings for Robust Large Vision-Language Models}, 
    author={Christian Schlarmann and Naman Deep Singh and Francesco Croce and Matthias Hein},
    year={2024},
    journal={ICML}
}
```