Model Details

The CLIP model was pretrained from openai/clip-vit-base-patch32 , to learn about what contributes to robustness in computer vision tasks.

The model has the ability to generalize to arbitrary image classification tasks in a zero-shot manner.

Top predictions:

       Saree: 64.89%
     Dupatta: 25.81%
     Lehenga: 7.51%

Leggings and Salwar: 0.84% Women Kurta: 0.44%

image/png

Use with Transformers

from PIL import Image
import requests

from transformers import CLIPProcessor, CLIPModel

model = CLIPModel.from_pretrained("samim2024/clip")
processor = CLIPProcessor.from_pretrained("samim2024/clip")

url = "https://www.istockphoto.com/photo/indian-saris-gm93355119-10451468"
image = Image.open(requests.get(url, stream=True).raw)

inputs = processor(text=["a photo of a saree", "a photo of a blouse"], images=image, return_tensors="pt", padding=True)

outputs = model(**inputs)
logits_per_image = outputs.logits_per_image # this is the image-text similarity score
probs = logits_per_image.softmax(dim=1) # we can take the softmax to get the label probabilities
Downloads last month
16
Safetensors
Model size
151M params
Tensor type
F32
·
Inference Examples
This model does not have enough activity to be deployed to Inference API (serverless) yet. Increase its social visibility and check back later, or deploy to Inference Endpoints (dedicated) instead.