Model card for CLIP-KD ViT-T-16 distilled from CLIP-ViT-B-16 pretrained from CC3M and CC12M

Github source: https://github.com/winycg/CLIP-KD

From weight: ViT_T_16_cc3m_12m_ep32.pt

The weight of this model was modified from open_clip format to be compatible with huggingface CLIP library.

Model Details

Model Description

A CLIP ViT-T/16 model pretrained with the CC3M and CC12M (https://github.com/google-research-datasets/conceptual-12m) using OpenCLIP (https://github.com/mlfoundations/open_clip).

Uses

This model weights can be downloaded using both open_clip and transformer CLIP library (currently at version 4.44.0). This model is a CLIP-based model, which is typically used for tasks like zero-shot image classification, text-image retrieval, and more.

Using `open_clip`

import torch
from PIL import Image
import open_clip

model_name = "romrawinjp/clip-kd_ViT-T-16_Baseline-CC3M12M"
model, preprocess = open_clip.create_model_from_pretrained('hf-hub:'+model_name)
tokenizer = open_clip.get_tokenizer('hf-hub:'+model_name)

url = 'http://images.cocodataset.org/val2017/000000039769.jpg'
image = Image.open(requests.get(url, stream=True).raw)
image = preprocess(image).unsqueeze(0)
text = tokenizer(["a diagram", "a dog", "a cat"])

with torch.no_grad(), torch.cuda.amp.autocast():
    image_features = model.encode_image(image)
    text_features = model.encode_text(text)
    image_features /= image_features.norm(dim=-1, keepdim=True)
    text_features /= text_features.norm(dim=-1, keepdim=True)
    text_probs = (100.0 * image_features @ text_features.T).softmax(dim=-1)

print("Label probs:", text_probs)

Using transformers library

from transformers import CLIPProcessor, CLIPModel
from PIL import Image

model_name = "romrawinjp/clip-kd_ViT-T-16_Baseline-CC3M12M"
model = CLIPModel.from_pretrained(model_name)
processor = CLIPProcessor.from_pretrained(model_name)

url = 'http://images.cocodataset.org/val2017/000000039769.jpg'
image = Image.open(requests.get(url, stream=True).raw)
text_labels = ["a photo of a cat", "a photo of a dog", "a photo of a bird"]

inputs = processor(text=text_labels, images=image, return_tensors="pt", padding=True)
outputs = model(**inputs)
logits_per_image = outputs.logits_per_image  # This is the image-text similarity score
probs = logits_per_image.softmax(dim=1)  # Convert logits to probabilities

Reference

Please refer to the original work.

@inproceedings{yang2024clip,
  title={CLIP-KD: An Empirical Study of CLIP Model Distillation},
  author={Yang, Chuanguang and An, Zhulin and Huang, Libo and Bi, Junyu and Yu, Xinqiang and Yang, Han and Diao, Boyu and Xu, Yongjun},
  booktitle={Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition},
  year={2024}
}

romrawinjp
/

clip-kd_ViT-T-16_Baseline-CC3M12M

Model card for CLIP-KD ViT-T-16 distilled from CLIP-ViT-B-16 pretrained from CC3M and CC12M

Model Details

Model Description

Uses

Using `open_clip`

Using transformers library

Reference

Datasets used to train romrawinjp/clip-kd_ViT-T-16_Baseline-CC3M12M

Model card for CLIP-KD ViT-T-16 distilled from CLIP-ViT-B-16 pretrained from CC3M and CC12M

Model Details

Model Description

Uses

Using open_clip

Using transformers library

Reference

Datasets used to train romrawinjp/clip-kd_ViT-T-16_Baseline-CC3M12M

Using `open_clip`