# RobustVLM [[Paper]](https://arxiv.org/abs/2402.12336) [[HuggingFace]](https://huggingface.co/collections/chs20/robust-clip-65d913e552eca001fdc41978) [[BibTeX]](#citation) This repository contains code for the paper "Robust CLIP: Unsupervised Adversarial Fine-Tuning of Vision Embeddings for Robust Large Vision-Language Models" (_Oral@ICML 2024_).

******

We fine-tune CLIP in an unsupervised manner to improve its robustness to visual adversarial attacks. We show that replacing the vision encoder of large vision-language models with our fine-tuned CLIP models yields state-of-the-art adversarial robustness on a variety of vision-language tasks, without requiring any training of the large VLMs themselves. Moreover, we improve the robustness of CLIP to adversarial attacks in zero-shot classification settings, while maintaining higher clean accuracy than previous adversarial fine-tuning methods. ## Table of Contents - [Installation](#installation) - [Models](#models) - [Loading pretrained models](#loading-pretrained-models) - [Summary of results](#summary-of-results) - [Training](#training) - [Evaluation](#evaluation) ## Installation The code is tested with Python 3.11. To install the required packages, run: ```shell pip install -r requirements.txt ``` ## Models We provide the following adversarially fine-tuned ViT-L/14 CLIP models (approx. 1.1 GB each): | Model | Link | Proposed by | Notes | |-------------------|--------------------------------------------------------------------------------------------------|--------------------------------------------------------|-------------------------------------------------------------------------------------------| | TeCoA² | [Link](https://nc.mlcloud.uni-tuebingen.de/index.php/s/5SQzfAbp8JHS3o7/download/tecoa_eps_2.pt) | [Mao et al. (2023)](https://arxiv.org/abs/2212.07016) | Supervised adversarial fine-tuning with $\ell_\infty$ norm, $\varepsilon=\frac{2}{255}$ | | TeCoA⁴ | [Link](https://nc.mlcloud.uni-tuebingen.de/index.php/s/92req4Pak5i56tX/download/tecoa_eps_4.pt) | [Mao et al. (2023)](https://arxiv.org/abs/2212.07016) | Supervised adversarial fine-tuning with $\ell_\infty$ norm, $\varepsilon=\frac{4}{255}$ | | FARE² | [Link](https://nc.mlcloud.uni-tuebingen.de/index.php/s/d83Lqm8Jpowxp4m/download/fare_eps_2.pt) | ours | Unsupervised adversarial fine-tuning with $\ell_\infty$ norm, $\varepsilon=\frac{2}{255}$ | | FARE⁴ | [Link](https://nc.mlcloud.uni-tuebingen.de/index.php/s/jnQ2qmp9tst8kyQ/download/fare_eps_4.pt) | ours | Unsupervised adversarial fine-tuning with $\ell_\infty$ norm, $\varepsilon=\frac{4}{255}$ | The models are also available on [HuggingFace](https://huggingface.co/collections/chs20/robust-clip-65d913e552eca001fdc41978). All models are adversarially fine-tuned for two epochs on ImageNet. TeCoA is trained in a supervised fashion, utilizing ImageNet class labels. FARE, in contrast, does not require any labels for training. ### Loading pretrained models The provided checkpoints correspond to the vision encoder of CLIP. To load the full CLIP model (including the text encoder), you can use the following code: ```python import torch from open_clip import create_model_and_transforms model, _, image_processor = create_model_and_transforms( 'ViT-L-14', pretrained='openai', device='cpu' ) checkpoint = torch.load('/path/to/fare_eps_2.pt', map_location=torch.device('cpu')) model.visual.load_state_dict(checkpoint) ``` Alternatively load directly from HuggingFace: ```python from open_clip import create_model_and_transforms model, _, image_processor = open_clip.create_model_and_transforms('hf-hub:chs20/fare2-clip') ``` ### Summary of results We show a summary of results on zero-shot classification and vision-language tasks for original and fine-tuned ViT-L/14 CLIP models. *CLIP-only* means that we evaluate the respective CLIP model in a standalone fashion for zero-shot classification, whereas *OpenFlamingo* and *LLaVA* evaluation means that we use the respective CLIP model as a vision encoder as part of these large vision-language models. Results for individual zero-shot datasets and more VLM tasks are provided in the paper. - Clean evaluation:

	CLIP-only	OpenFlamingo 9B		LLaVA 1.5 7B
Model	Avg. zero-shot	COCO	TextVQA	COCO	TextVQA
OpenAI	73.1	79.7	23.8	115.5	37.1
TeCoA²	60.0	73.5	16.6	98.4	24.1
FARE²	67.0	79.1	21.6	109.9	31.9
TeCoA⁴	54.2	66.9	15.4	88.3	20.7
FARE⁴	61.1	74.1	18.6	102.4	27.6

- Adversarial evaluation ($\ell_\infty, ~ \varepsilon=\frac{2}{255}$):

	CLIP-only	OpenFlamingo 9B		LLaVA 1.5 7B
Model	Avg. zero-shot	COCO	TextVQA	COCO	TextVQA
Openai	0.0	1.5	0.0	4.0	0.5
TeCoA²	43.6	31.6	3.5	44.2	12.1
FARE²	43.1	34.2	4.1	53.6	14.7
TeCoA⁴	42.3	28.5	2.1	50.9	12.6
FARE⁴	45.9	30.9	3.4	57.1	15.8

- Adversarial evaluation ($\ell_\infty, ~ \varepsilon=\frac{4}{255}$):

	CLIP-only	OpenFlamingo 9B		LLaVA 1.5 7B
Model	Avg. zero-shot	COCO	TextVQA	COCO	TextVQA
Openai	0.0	1.1	0.0	3.1	0.0
TeCoA²	27.0	21.2	2.1	30.3	8.8
FARE²	20.5	19.5	1.9	31.0	9.1
TeCoA⁴	31.9	21.6	1.8	35.3	9.3
FARE⁴	32.4	22.8	2.9	40.9	10.9

## Training - TeCoA⁴ ```shell python -m train.adversarial_training_clip --clip_model_name ViT-L-14 --pretrained openai --dataset imagenet --imagenet_root /path/to/imagenet --template std --output_normalize True --steps 20000 --warmup 1400 --batch_size 128 --loss ce --opt adamw --lr 1e-5 --wd 1e-4 --attack pgd --inner_loss ce --norm linf --eps 4 --iterations_adv 10 --stepsize_adv 1 --wandb False --output_dir /path/to/out/dir --experiment_name TECOA4 --log_freq 10 --eval_freq 10``` ``` - FARE⁴ ```shell python -m train.adversarial_training_clip --clip_model_name ViT-L-14 --pretrained openai --dataset imagenet --imagenet_root /path/to/imagenet --template std --output_normalize False --steps 20000 --warmup 1400 --batch_size 128 --loss l2 --opt adamw --lr 1e-5 --wd 1e-4 --attack pgd --inner_loss l2 --norm linf --eps 4 --iterations_adv 10 --stepsize_adv 1 --wandb False --output_dir /path/to/out/dir --experiment_name FARE4 --log_freq 10 --eval_freq 10 ``` Set `--eps 2` to obtain TeCoA² and FARE² models. ## Evaluation Make sure files in `bash` directory are executable: `chmod +x bash/*` ### CLIP ImageNet ```shell python -m CLIP_eval.clip_robustbench --clip_model_name ViT-L-14 --pretrained /path/to/ckpt.pt --dataset imagenet --imagenet_root /path/to/imagenet --wandb False --norm linf --eps 2 ``` ### CLIP Zero-Shot Set models to be evaluated in `CLIP_benchmark/benchmark/models.txt` and datasets in `CLIP_benchmark/benchmark/datasets.txt` (the datasets are downloaded from HuggingFace). Then run ```shell cd CLIP_benchmark ./bash/run_benchmark_adv.sh ``` ### VLM Captioning and VQA #### LLaVA In `/bash/llava_eval.sh` supply paths for the datasets. The required annotation files for the datasets can be obtained from this [HuggingFace repository](https://huggingface.co/datasets/openflamingo/eval_benchmark/tree/main). Set `--vision_encoder_pretrained` to `openai` or supply path to fine-tuned CLIP model checkpoint. Then run ```shell ./bash/llava_eval.sh ``` The LLaVA model will be automatically downloaded from HuggingFace. #### OpenFlamingo Download the OpenFlamingo 9B [model](https://huggingface.co/openflamingo/OpenFlamingo-9B-vitl-mpt7b/tree/main), supply paths in `/bash/of_eval_9B.sh` and run ```shell ./bash/of_eval_9B.sh ``` Some non-standard annotation files are supplied [here](https://nc.mlcloud.uni-tuebingen.de/index.php/s/mtRnQFaZJkR9zaX) and [here](https://github.com/mlfoundations/open_flamingo/tree/main/open_flamingo/eval/data). ### VLM Stealthy Targeted Attacks For targeted attacks on COCO, run ```shell ./bash/llava_eval_targeted.sh ``` For targeted attacks on self-selected images, set images and target captions in `vlm_eval/run_evaluation_qualitative.py` and run ```shell python -m vlm_eval.run_evaluation_qualitative --precision float32 --attack apgd --eps 2 --steps 10000 --vlm_model_name llava --vision_encoder_pretrained openai --verbose ``` With 10,000 iterations it takes about 2 hours per image on an A100 GPU. ### POPE ```shell ./bash/eval_pope.sh openai # for clean model evaluation ./bash/eval_pope.sh # for robust model evaluation - add path_to_ckpt in bash file ``` ### SQA ```shell ./bash/eval_scienceqa.sh openai # for clean model evaluation ./bash/eval_scienceqa.sh # for robust model evaluation - add path_to_ckpt in bash file ``` ## Acknowledgements This repository gratefully forks from - [OpenFlamingo](https://github.com/mlfoundations/open_flamingo) - [LLaVA](https://github.com/haotian-liu/LLaVA) - [CLIP Benchmark](https://github.com/LAION-AI/CLIP_benchmark) - [AutoAttack](https://github.com/fra31/auto-attack) ## Citation If you find this repository useful, please consider citing our paper: ```bibtex @article{schlarmann2024robustclip, title={Robust CLIP: Unsupervised Adversarial Fine-Tuning of Vision Embeddings for Robust Large Vision-Language Models}, author={Christian Schlarmann and Naman Deep Singh and Francesco Croce and Matthias Hein}, year={2024}, journal={ICML} } ```