RobustVLM (Foundation Models) via Object-centric Learning
Table of Contents
Installation
Create and activate anaconda environment:
conda create -n robustclip python==3.11
conda activate robustclip
The code is tested with Python 3.11. To install the required packages, run:
pip install -r requirements.txt
To install the open_clip_torch locally run:
cd ./open_clip_torch
python setup.py develop
Stage1: Get Object-centric Models
Dataset
Prepare the ImageNet dataset in a torch.ImageFolder style format:
dataset_path
└─imagenet
└─train
└─n01440764
xxxxxx.JPEG
.....
└─......
└─val
└─n04254680
xxxxxx.JPEG
.....
└─......
Training
- Slot-Attention on 4GPUs
CUDA_VISIBLE_DEVICES=0,1,2,3 python -m train.training_clip_slots --clip_model_name ViT-L-14 --pretrained openai --dataset imagenet --imagenet_root /.../.../dataset_path/imagenet --template std --output_normalize False --steps 1000000 --warmup 10000 --batch_size 128 --loss l2 --opt adamw --lr 5e-5 --wd 1e-4 --attack pgd --inner_loss l2 --norm linf --eps 4 --iterations_adv 10 --stepsize_adv 1 --wandb False --output_dir ./output_slots --experiment_name SLOTS --log_freq 1000 --eval_freq 1000```
The results of reconstruction after slot-attention and ckps are stored in './output_slots/ViT-L-14_openai_imagenet_l2_imagenet_SLOTS_xxxxx'
Stage2: Training and Evaluation with Object-centric Representations
- SlotVLM4
CUDA_VISIBLE_DEVICES=0,1,2,3 python -m train.adversarial_training_clip_with_object_token --clip_model_name ViT-L-14 --slots_ckp ./ckps/model_slots_step_300000.pt --pretrained openai --dataset imagenet --imagenet_root /path/to/imagenet --template std --output_normalize False --steps 20000 --warmup 1400 --batch_size 128 --loss l2 --opt adamw --lr 1e-5 --wd 1e-4 --attack pgd --inner_loss l2 --norm linf --eps 4 --iterations_adv 10 --stepsize_adv 1 --wandb False --output_dir ./output --experiment_name with_OT --log_freq 10 --eval_freq 10
Set --eps 2
to obtain SlotVLM2 models.
If you want to resume your training, just add some params like:
--optimizer_state /xxx/checkpoints/fallback_80000_opt.pt --start_step 80000 --pretrained none
CUDA_VISIBLE_DEVICES=0,1,2,3 python -m train.adversarial_training_clip_with_object_token --clip_model_name ViT-L-14 --slots_ckp ./ckps/model_slots_step_300000.pt --dataset imagenet --imagenet_root /path/to/imagenet --template std --output_normalize False --steps 20000 --warmup 1400 --batch_size 128 --loss l2 --opt adamw --lr 1e-5 --wd 1e-4 --attack pgd --inner_loss l2 --norm linf --eps 4 --iterations_adv 10 --stepsize_adv 1 --wandb False --output_dir ./output --experiment_name with_OT --log_freq 10 --eval_freq 10 --optimizer_state /home/xxx/RobustVLM/output/ViT-L-14_openai_imagenet_l2_imagenet_with_Object_Token_xxxxx/checkpoints/fallback_80000_opt.pt --start_step 80000 --pretrained none
Evaluation
Make sure files in bash
directory are executable: chmod +x bash/*
CLIP ImageNet
python -m CLIP_eval.clip_robustbench --clip_model_name ViT-L-14 --pretrained /path/to/ckpt.pt --dataset imagenet --imagenet_root /path/to/imagenet --wandb False --norm linf --eps 2
⬆ You should notice the --pretrained
and the --eps 2/4
for SlotVLM2/4 models.
CLIP Zero-Shot
Set models to be evaluated in CLIP_benchmark/benchmark/models.txt
and datasets in CLIP_benchmark/benchmark/datasets.txt
(the datasets are downloaded from HuggingFace). Then run
cd CLIP_benchmark
./bash/run_benchmark_adv.sh
VLM Captioning and VQA
LLaVA
In /bash/llava_eval.sh
supply paths for the datasets. The required annotation files for the datasets can be obtained from this HuggingFace repository.
Set --vision_encoder_pretrained
to openai
or supply path to fine-tuned CLIP model checkpoint.
Then run
./bash/llava_eval.sh
The LLaVA model will be automatically downloaded from HuggingFace.
OpenFlamingo
Download the OpenFlamingo 9B model, supply paths in /bash/of_eval_9B.sh
and run
./bash/of_eval_9B.sh
Some non-standard annotation files are supplied here and here.
VLM Stealthy Targeted Attacks
For targeted attacks on COCO, run
./bash/llava_eval_targeted.sh
For targeted attacks on self-selected images, set images and target captions in vlm_eval/run_evaluation_qualitative.py
and run
python -m vlm_eval.run_evaluation_qualitative --precision float32 --attack apgd --eps 2 --steps 10000 --vlm_model_name llava --vision_encoder_pretrained openai --verbose
With 10,000 iterations it takes about 2 hours per image on an A100 GPU.
POPE
./bash/eval_pope.sh openai # for clean model evaluation
./bash/eval_pope.sh # for robust model evaluation - add path_to_ckpt in bash file
SQA
./bash/eval_scienceqa.sh openai # for clean model evaluation
./bash/eval_scienceqa.sh # for robust model evaluation - add path_to_ckpt in bash file