Adversarial CLIP ViT Base Patch32 Fine-Tuned on PatchCamelyon (PCAM)

Overview

-This repository contains a model trained on adversarial data of the CLIP ViT Base Patch32 finetuned model on the PatchCamelyon (PCAM) dataset and also on PatchCamelyon Adversarial(PCAM) dataset.The model is optimized for histopathological image classification.

📌 Model Highlights

Model Type: CLIP Vision Transformer (ViT-B/32) with classification head
Task: Binary classification of histopathological images (cancer vs. non-cancer)
Base Model: openai/clip-vit-base-patch32
Training Data: PatchCamelyon (PCAM) and Adversarial PCAM datasets
Input: RGB images (224x224 pixels)
Output: Binary classification (cancer/non-cancer)

🚀 Key Results

✅ Clean Evaluation Metrics

Clean Accuracy: 86.72%

⚔️ Adversarial Robustness (Fine-tuned Model)

PGD Attack:
- Success Rate: 17.87%
- Average L2 Distance: 12.09
FGSM Attack:
- Success Rate: 17.38%
- Average L2 Distance: 12.10
DeepFool Attack:
- Success Rate: 35.62%
- Average L2 Distance: 234.13

📊 Base Model Comparison

Clean Accuracy: 86.30%
PGD: 50.10% Success Rate | Avg L2 Distance: 12.08
FGSM: 44.14% Success Rate | Avg L2 Distance: 12.10
DeepFool: 81.64% Success Rate | Avg L2 Distance: 224.66

Hardware: Trained on NVIDIA A100 GPU (5 epochs)

🔧 Usage

Installation

pip install transformers torch safetensors

Inference Example

from transformers import CLIPVisionConfig, CLIPVisionModel, CLIPFeatureExtractor
import torch
from torch import nn

class PCamClassifier(nn.Module):
    def __init__(self, config_dict):
        super().__init__()
        self.config = CLIPVisionConfig(**config_dict)
        self.vision_model = CLIPVisionModel(self.config)
        self.classifier = nn.Linear(self.config.hidden_size, 2)

    def forward(self, pixel_values):
        outputs = self.vision_model(pixel_values)
        return self.classifier(outputs.pooler_output)

# Load model
config_dict = {
    "_name_or_path": "openai/clip-vit-base-patch32",
    "architectures": ["CLIPVisionModel"],
    "attention_dropout": 0.0,
    "dropout": 0.0,
    "hidden_act": "quick_gelu",
    "hidden_size": 768,
    "image_size": 224,
    "initializer_factor": 1.0,
    "initializer_range": 0.02,
    "intermediate_size": 3072,
    "layer_norm_eps": 1e-05,
    "model_type": "clip_vision_model",
    "num_attention_heads": 12,
    "num_channels": 3,
    "num_hidden_layers": 12,
    "patch_size": 32,
    "projection_dim": 512,
    "torch_dtype": "float32"
}

# Initialize model
model = PCamClassifier(config_dict)
model.load_state_dict(torch.load('best_enhanced_pcam_model.pt'))


class PCamDataset(Dataset):
    def __init__(self, dataset):
        self.dataset = dataset
        
    def __len__(self):
        return len(self.dataset)
        
    def __getitem__(self, idx):
        example = self.dataset[idx]
        image = example["image"].convert("RGB")
        image_array = np.array(image) / 255.0
        image_array = image_array.transpose(2, 0, 1).astype(np.float32)
        return {
            "pixel_values": image_array,
            "labels": example["label"]
        }

📊 Future Work

We plan to release:

Enhanced robustness metrics
Expanded adversarial attack evaluations

📜 License

Released under the Apache-2.0 License.

📬 Contact

For inquiries, please reach out to Venkata Tej at LensAI.

lens-ai
/

adversarial-clip-vit-base-patch32_pcam_finetuned