metadata

pipeline_tag: image-text-to-text
library_name: transformers
license: mit

DiffCLIP: Differential Attention Meets CLIP

This repository contains the DiffCLIP model as presented in DiffCLIP: Differential Attention Meets CLIP.

Project Page: https://hammoudhasan.github.io/DiffCLIP

Code: https://github.com/hammoudhasan/DiffCLIP

How to Use

Installation

# Clone the repository
git clone https://github.com/hammoudhasan/DiffCLIP.git
cd DiffCLIP

# Install dependencies
pip install -r requirements.txt

Basic Usage

import torch
from diff_clip import DiffCLIP_VITB16

# Create model
model = DiffCLIP_VITB16()

# Process image and text
image = torch.randn(1, 3, 224, 224)
text = torch.randint(0, 49408, (1, 77))  # Tokenized text

# Get embeddings
with torch.no_grad():
    outputs = model(image, text)

print(outputs["image_embed"].shape)  # Should be [1, 512]
print(outputs["text_embed"].shape)   # Should be [1, 512]

Zero-Shot Classification

You can use the provided test_models.py script to perform zero-shot classification. See the GitHub README for details.