# RobustVLM (Foundation Models) via Object-centric Learning


## Table of Contents
- [Installation](#installation)
- [Stage1: Get Object-centric Models](#models)
    - [Dataset](#loading-pretrained-models)
    - [Training](#summary-of-results)

## Installation
Create and activate anaconda environment:
```shell
conda create -n robustclip python==3.11
```
```shell
conda activate robustclip 
```

The code is tested with Python 3.11. To install the required packages, run:
```shell
pip install -r requirements.txt
```

To install the open_clip_torch locally run:
```shell
cd ./open_clip_torch
```
```shell
python setup.py develop
```

## Stage1: Get Object-centric Models

### Dataset
Prepare the ImageNet dataset in a torch.ImageFolder style format:
```
dataset_path
└─imagenet
  └─train
    └─n01440764
        xxxxxx.JPEG
        .....
    └─......
  └─val
    └─n04254680
        xxxxxx.JPEG
        .....
    └─......
```
### Training
- Slot-Attention on 4GPUs
```shell
CUDA_VISIBLE_DEVICES=0,1,2,3 python -m train.training_clip_slots --clip_model_name ViT-L-14 --pretrained openai --dataset imagenet --imagenet_root /.../.../dataset_path/imagenet --template std --output_normalize False --steps 1000000 --warmup 10000 --batch_size 128 --loss l2 --opt adamw --lr 5e-5 --wd 1e-4 --attack pgd --inner_loss l2 --norm linf --eps 4 --iterations_adv 10 --stepsize_adv 1 --wandb False --output_dir ./output_slots --experiment_name SLOTS --log_freq 1000 --eval_freq 1000```
```

The results of reconstruction after slot-attention and ckps are stored in './output_slots/ViT-L-14_openai_imagenet_l2_imagenet_SLOTS_xxxxx'