File size: 12,338 Bytes
e1aaaac |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304 305 306 307 308 309 310 311 312 313 314 315 316 317 318 319 320 321 322 323 324 325 326 327 328 329 330 331 332 333 334 |
# RobustVLM
[[Paper]](https://arxiv.org/abs/2402.12336) [[HuggingFace]](https://huggingface.co/collections/chs20/robust-clip-65d913e552eca001fdc41978) [[BibTeX]](#citation)
This repository contains code for the paper "Robust CLIP: Unsupervised Adversarial Fine-Tuning of Vision Embeddings for Robust Large Vision-Language Models" (_Oral@ICML 2024_).
<p align="center">
<img src="assets/teaser0.png" width="500">
<br>
</p>
******
<p align="center">
<img src="assets/teaser1.png" width="800">
</p>
We fine-tune CLIP in an unsupervised manner to improve its robustness to visual adversarial attacks.
We show that replacing the vision encoder of large vision-language models with our fine-tuned CLIP models yields state-of-the-art
adversarial robustness on a variety of vision-language tasks, without requiring any training of the large VLMs themselves.
Moreover, we improve the robustness of CLIP to adversarial attacks in zero-shot classification settings, while maintaining
higher clean accuracy than previous adversarial fine-tuning methods.
## Table of Contents
- [Installation](#installation)
- [Models](#models)
- [Loading pretrained models](#loading-pretrained-models)
- [Summary of results](#summary-of-results)
- [Training](#training)
- [Evaluation](#evaluation)
## Installation
The code is tested with Python 3.11. To install the required packages, run:
```shell
pip install -r requirements.txt
```
## Models
We provide the following adversarially fine-tuned ViT-L/14 CLIP models (approx. 1.1 GB each):
| Model | Link | Proposed by | Notes |
|-------------------|--------------------------------------------------------------------------------------------------|--------------------------------------------------------|-------------------------------------------------------------------------------------------|
| TeCoA<sup>2</sup> | [Link](https://nc.mlcloud.uni-tuebingen.de/index.php/s/5SQzfAbp8JHS3o7/download/tecoa_eps_2.pt) | [Mao et al. (2023)](https://arxiv.org/abs/2212.07016) | Supervised adversarial fine-tuning with $\ell_\infty$ norm, $\varepsilon=\frac{2}{255}$ |
| TeCoA<sup>4</sup> | [Link](https://nc.mlcloud.uni-tuebingen.de/index.php/s/92req4Pak5i56tX/download/tecoa_eps_4.pt) | [Mao et al. (2023)](https://arxiv.org/abs/2212.07016) | Supervised adversarial fine-tuning with $\ell_\infty$ norm, $\varepsilon=\frac{4}{255}$ |
| FARE<sup>2</sup> | [Link](https://nc.mlcloud.uni-tuebingen.de/index.php/s/d83Lqm8Jpowxp4m/download/fare_eps_2.pt) | ours | Unsupervised adversarial fine-tuning with $\ell_\infty$ norm, $\varepsilon=\frac{2}{255}$ |
| FARE<sup>4</sup> | [Link](https://nc.mlcloud.uni-tuebingen.de/index.php/s/jnQ2qmp9tst8kyQ/download/fare_eps_4.pt) | ours | Unsupervised adversarial fine-tuning with $\ell_\infty$ norm, $\varepsilon=\frac{4}{255}$ |
The models are also available on [HuggingFace](https://huggingface.co/collections/chs20/robust-clip-65d913e552eca001fdc41978).
All models are adversarially fine-tuned for two epochs on ImageNet. TeCoA is trained in a supervised fashion, utilizing ImageNet class labels. FARE, in contrast, does not require any labels for training.
### Loading pretrained models
The provided checkpoints correspond to the vision encoder of CLIP. To load the full CLIP model (including the text encoder), you can use the following code:
```python
import torch
from open_clip import create_model_and_transforms
model, _, image_processor = create_model_and_transforms(
'ViT-L-14', pretrained='openai', device='cpu'
)
checkpoint = torch.load('/path/to/fare_eps_2.pt', map_location=torch.device('cpu'))
model.visual.load_state_dict(checkpoint)
```
Alternatively load directly from HuggingFace:
```python
from open_clip import create_model_and_transforms
model, _, image_processor = open_clip.create_model_and_transforms('hf-hub:chs20/fare2-clip')
```
### Summary of results
We show a summary of results on zero-shot classification and vision-language tasks for original and fine-tuned ViT-L/14 CLIP models. *CLIP-only* means that we evaluate
the respective CLIP model in a standalone fashion for zero-shot classification, whereas *OpenFlamingo* and *LLaVA* evaluation means that we use the respective CLIP model
as a vision encoder as part of these large vision-language models. Results for individual zero-shot datasets and more VLM tasks
are provided in the paper.
- Clean evaluation:
<table>
<tr>
<td></td>
<td>CLIP-only</td>
<td colspan="2">OpenFlamingo 9B</td>
<td colspan="2">LLaVA 1.5 7B</td>
</tr>
<tr>
<td>Model</td>
<td>Avg. zero-shot</td>
<td>COCO</td>
<td>TextVQA</td>
<td>COCO</td>
<td>TextVQA</td>
</tr>
<tr>
<td>OpenAI</td>
<td>73.1</td>
<td>79.7</td>
<td>23.8</td>
<td>115.5</td>
<td>37.1</td>
</tr>
<tr>
<td>TeCoA<sup>2</sup></td>
<td>60.0</td>
<td>73.5</td>
<td>16.6</td>
<td>98.4</td>
<td>24.1</td>
</tr>
<tr>
<td>FARE<sup>2</sup></td>
<td>67.0</td>
<td>79.1</td>
<td>21.6</td>
<td>109.9</td>
<td>31.9</td>
</tr>
<tr>
<td>TeCoA<sup>4</sup></td>
<td>54.2</td>
<td>66.9</td>
<td>15.4</td>
<td>88.3</td>
<td>20.7</td>
</tr>
<tr>
<td>FARE<sup>4</sup></td>
<td>61.1</td>
<td>74.1</td>
<td>18.6</td>
<td>102.4</td>
<td>27.6</td>
</tr>
</table>
- Adversarial evaluation ($\ell_\infty, ~ \varepsilon=\frac{2}{255}$):
<table>
<tr>
<td></td>
<td>CLIP-only</td>
<td colspan="2">OpenFlamingo 9B</td>
<td colspan="2">LLaVA 1.5 7B</td>
</tr>
<tr>
<td>Model</td>
<td>Avg. zero-shot</td>
<td>COCO</td>
<td>TextVQA</td>
<td>COCO</td>
<td>TextVQA</td>
</tr>
<tr>
<td>Openai</td>
<td>0.0</td>
<td>1.5</td>
<td>0.0</td>
<td>4.0</td>
<td>0.5</td>
</tr>
<tr>
<td>TeCoA<sup>2</sup></td>
<td>43.6</td>
<td>31.6</td>
<td>3.5</td>
<td>44.2</td>
<td>12.1</td>
</tr>
<tr>
<td>FARE<sup>2</sup></td>
<td>43.1</td>
<td>34.2</td>
<td>4.1</td>
<td>53.6</td>
<td>14.7</td>
</tr>
<tr>
<td>TeCoA<sup>4</sup></td>
<td>42.3</td>
<td>28.5</td>
<td>2.1</td>
<td>50.9</td>
<td>12.6</td>
</tr>
<tr>
<td>FARE<sup>4</sup></td>
<td>45.9</td>
<td>30.9</td>
<td>3.4</td>
<td>57.1</td>
<td>15.8</td>
</tr>
</table>
- Adversarial evaluation ($\ell_\infty, ~ \varepsilon=\frac{4}{255}$):
<table>
<tr>
<td></td>
<td>CLIP-only</td>
<td colspan="2">OpenFlamingo 9B</td>
<td colspan="2">LLaVA 1.5 7B</td>
</tr>
<tr>
<td>Model</td>
<td>Avg. zero-shot</td>
<td>COCO</td>
<td>TextVQA</td>
<td>COCO</td>
<td>TextVQA</td>
</tr>
<tr>
<td>Openai</td>
<td>0.0</td>
<td>1.1</td>
<td>0.0</td>
<td>3.1</td>
<td>0.0</td>
</tr>
<tr>
<td>TeCoA<sup>2</sup></td>
<td>27.0</td>
<td>21.2</td>
<td>2.1</td>
<td>30.3</td>
<td>8.8</td>
</tr>
<tr>
<td>FARE<sup>2</sup></td>
<td>20.5</td>
<td>19.5</td>
<td>1.9</td>
<td>31.0</td>
<td>9.1</td>
</tr>
<tr>
<td>TeCoA<sup>4</sup></td>
<td>31.9</td>
<td>21.6</td>
<td>1.8</td>
<td>35.3</td>
<td>9.3</td>
</tr>
<tr>
<td>FARE<sup>4</sup></td>
<td>32.4</td>
<td>22.8</td>
<td>2.9</td>
<td>40.9</td>
<td>10.9</td>
</tr>
</table>
## Training
- TeCoA<sup>4</sup>
```shell
python -m train.adversarial_training_clip --clip_model_name ViT-L-14 --pretrained openai --dataset imagenet --imagenet_root /path/to/imagenet --template std --output_normalize True --steps 20000 --warmup 1400 --batch_size 128 --loss ce --opt adamw --lr 1e-5 --wd 1e-4 --attack pgd --inner_loss ce --norm linf --eps 4 --iterations_adv 10 --stepsize_adv 1 --wandb False --output_dir /path/to/out/dir --experiment_name TECOA4 --log_freq 10 --eval_freq 10```
```
- FARE<sup>4</sup>
```shell
python -m train.adversarial_training_clip --clip_model_name ViT-L-14 --pretrained openai --dataset imagenet --imagenet_root /path/to/imagenet --template std --output_normalize False --steps 20000 --warmup 1400 --batch_size 128 --loss l2 --opt adamw --lr 1e-5 --wd 1e-4 --attack pgd --inner_loss l2 --norm linf --eps 4 --iterations_adv 10 --stepsize_adv 1 --wandb False --output_dir /path/to/out/dir --experiment_name FARE4 --log_freq 10 --eval_freq 10
```
Set `--eps 2` to obtain TeCoA<sup>2</sup> and FARE<sup>2</sup> models.
## Evaluation
Make sure files in `bash` directory are executable: `chmod +x bash/*`
### CLIP ImageNet
```shell
python -m CLIP_eval.clip_robustbench --clip_model_name ViT-L-14 --pretrained /path/to/ckpt.pt --dataset imagenet --imagenet_root /path/to/imagenet --wandb False --norm linf --eps 2
```
### CLIP Zero-Shot
Set models to be evaluated in `CLIP_benchmark/benchmark/models.txt` and datasets in `CLIP_benchmark/benchmark/datasets.txt`
(the datasets are downloaded from HuggingFace). Then run
```shell
cd CLIP_benchmark
./bash/run_benchmark_adv.sh
```
### VLM Captioning and VQA
#### LLaVA
In `/bash/llava_eval.sh` supply paths for the datasets. The required annotation files for the datasets can be obtained from this [HuggingFace repository](https://huggingface.co/datasets/openflamingo/eval_benchmark/tree/main).
Set `--vision_encoder_pretrained` to `openai` or supply path to fine-tuned CLIP model checkpoint.
Then run
```shell
./bash/llava_eval.sh
```
The LLaVA model will be automatically downloaded from HuggingFace.
#### OpenFlamingo
Download the OpenFlamingo 9B [model](https://huggingface.co/openflamingo/OpenFlamingo-9B-vitl-mpt7b/tree/main), supply paths in `/bash/of_eval_9B.sh` and run
```shell
./bash/of_eval_9B.sh
```
Some non-standard annotation files are supplied [here](https://nc.mlcloud.uni-tuebingen.de/index.php/s/mtRnQFaZJkR9zaX) and [here](https://github.com/mlfoundations/open_flamingo/tree/main/open_flamingo/eval/data).
### VLM Stealthy Targeted Attacks
For targeted attacks on COCO, run
```shell
./bash/llava_eval_targeted.sh
```
For targeted attacks on self-selected images, set images and target captions in `vlm_eval/run_evaluation_qualitative.py` and run
```shell
python -m vlm_eval.run_evaluation_qualitative --precision float32 --attack apgd --eps 2 --steps 10000 --vlm_model_name llava --vision_encoder_pretrained openai --verbose
```
With 10,000 iterations it takes about 2 hours per image on an A100 GPU.
### POPE
```shell
./bash/eval_pope.sh openai # for clean model evaluation
./bash/eval_pope.sh # for robust model evaluation - add path_to_ckpt in bash file
```
### SQA
```shell
./bash/eval_scienceqa.sh openai # for clean model evaluation
./bash/eval_scienceqa.sh # for robust model evaluation - add path_to_ckpt in bash file
```
## Acknowledgements
This repository gratefully forks from
- [OpenFlamingo](https://github.com/mlfoundations/open_flamingo)
- [LLaVA](https://github.com/haotian-liu/LLaVA)
- [CLIP Benchmark](https://github.com/LAION-AI/CLIP_benchmark)
- [AutoAttack](https://github.com/fra31/auto-attack)
## Citation
If you find this repository useful, please consider citing our paper:
```bibtex
@article{schlarmann2024robustclip,
title={Robust CLIP: Unsupervised Adversarial Fine-Tuning of Vision Embeddings for Robust Large Vision-Language Models},
author={Christian Schlarmann and Naman Deep Singh and Francesco Croce and Matthias Hein},
year={2024},
journal={ICML}
}
```
|