RobustVLM (Foundation Models) via Object-centric Learning

Installation
Stage1: Get Object-centric Models
- Dataset
- Training

Installation

Create and activate anaconda environment:

conda create -n robustclip python==3.11

conda activate robustclip

The code is tested with Python 3.11. To install the required packages, run:

pip install -r requirements.txt

To install the open_clip_torch locally run:

cd ./open_clip_torch

python setup.py develop

Stage1: Get Object-centric Models

Dataset

Prepare the ImageNet dataset in a torch.ImageFolder style format:

dataset_path
└─imagenet
  └─train
    └─n01440764
        xxxxxx.JPEG
        .....
    └─......
  └─val
    └─n04254680
        xxxxxx.JPEG
        .....
    └─......

Training

Slot-Attention on 4GPUs

CUDA_VISIBLE_DEVICES=0,1,2,3 python -m train.training_clip_slots --clip_model_name ViT-L-14 --pretrained openai --dataset imagenet --imagenet_root /.../.../dataset_path/imagenet --template std --output_normalize False --steps 1000000 --warmup 10000 --batch_size 128 --loss l2 --opt adamw --lr 5e-5 --wd 1e-4 --attack pgd --inner_loss l2 --norm linf --eps 4 --iterations_adv 10 --stepsize_adv 1 --wandb False --output_dir ./output_slots --experiment_name SLOTS --log_freq 1000 --eval_freq 1000```

The results of reconstruction after slot-attention and ckps are stored in './output_slots/ViT-L-14_openai_imagenet_l2_imagenet_SLOTS_xxxxx'

Stage2: Training and Evaluation with Object-centric Representations

SlotVLM⁴

CUDA_VISIBLE_DEVICES=0,1,2,3 python -m train.adversarial_training_clip_with_object_token --clip_model_name ViT-L-14 --slots_ckp ./ckps/model_slots_step_300000.pt --pretrained openai --dataset imagenet --imagenet_root /path/to/imagenet --template std --output_normalize False --steps 20000 --warmup 1400 --batch_size 128 --loss l2 --opt adamw --lr 1e-5 --wd 1e-4 --attack pgd --inner_loss l2 --norm linf --eps 4 --iterations_adv 10 --stepsize_adv 1 --wandb False --output_dir ./output --experiment_name with_OT --log_freq 10 --eval_freq 10

Set --eps 2 to obtain SlotVLM² models.

If you want to resume your training, just add some params like:
--optimizer_state /xxx/checkpoints/fallback_80000_opt.pt --start_step 80000 --pretrained none

CUDA_VISIBLE_DEVICES=0,1,2,3 python -m train.adversarial_training_clip_with_object_token --clip_model_name ViT-L-14 --slots_ckp ./ckps/model_slots_step_300000.pt --dataset imagenet --imagenet_root /path/to/imagenet --template std --output_normalize False --steps 20000 --warmup 1400 --batch_size 128 --loss l2 --opt adamw --lr 1e-5 --wd 1e-4 --attack pgd --inner_loss l2 --norm linf --eps 4 --iterations_adv 10 --stepsize_adv 1 --wandb False --output_dir ./output --experiment_name with_OT --log_freq 10 --eval_freq 10 --optimizer_state /home/xxx/RobustVLM/output/ViT-L-14_openai_imagenet_l2_imagenet_with_Object_Token_xxxxx/checkpoints/fallback_80000_opt.pt --start_step 80000 --pretrained none

Evaluation

Make sure files in bash directory are executable: chmod +x bash/*

CLIP ImageNet

python -m CLIP_eval.clip_robustbench --clip_model_name ViT-L-14 --pretrained /path/to/ckpt.pt --dataset imagenet --imagenet_root /path/to/imagenet --wandb False --norm linf --eps 2

⬆ You should notice the --pretrained and the --eps 2/4 for SlotVLM^2/4 models.

CLIP Zero-Shot

Set models to be evaluated in CLIP_benchmark/benchmark/models.txt and datasets in CLIP_benchmark/benchmark/datasets.txt (the datasets are downloaded from HuggingFace). Then run

cd CLIP_benchmark
./bash/run_benchmark_adv.sh

VLM Captioning and VQA

LLaVA

In /bash/llava_eval.sh supply paths for the datasets. The required annotation files for the datasets can be obtained from this HuggingFace repository. Set --vision_encoder_pretrained to openai or supply path to fine-tuned CLIP model checkpoint. Then run

./bash/llava_eval.sh

The LLaVA model will be automatically downloaded from HuggingFace.

OpenFlamingo

Download the OpenFlamingo 9B model, supply paths in /bash/of_eval_9B.sh and run

./bash/of_eval_9B.sh

Some non-standard annotation files are supplied here and here.

VLM Stealthy Targeted Attacks

For targeted attacks on COCO, run

./bash/llava_eval_targeted.sh

For targeted attacks on self-selected images, set images and target captions in vlm_eval/run_evaluation_qualitative.py and run

python -m vlm_eval.run_evaluation_qualitative --precision float32 --attack apgd --eps 2 --steps 10000 --vlm_model_name llava --vision_encoder_pretrained openai --verbose

With 10,000 iterations it takes about 2 hours per image on an A100 GPU.

POPE

./bash/eval_pope.sh openai  # for clean  model evaluation
./bash/eval_pope.sh   # for robust  model evaluation - add path_to_ckpt in bash file

SQA

./bash/eval_scienceqa.sh openai  # for clean  model evaluation
./bash/eval_scienceqa.sh   # for robust  model evaluation - add path_to_ckpt in bash file

xmutly
/

robustvlm-object-centric