Model card for CLIP-KD ViT-T-16 distilled from CLIP-ViT-B-16 pretrained from CC3M and CC12M
Github source: https://github.com/winycg/CLIP-KD
From weight: ViT_T_16_cc3m_12m_ep32.pt
The weight of this model was modified from open_clip
format to be compatible with huggingface CLIP
library.
Model Details
Model Description
A CLIP ViT-T/16 model pretrained with the CC3M and CC12M (https://github.com/google-research-datasets/conceptual-12m) using OpenCLIP (https://github.com/mlfoundations/open_clip).
Uses
This model weights can be downloaded using both open_clip
and transformer CLIP
library (currently at version 4.44.0). This model is a CLIP-based model, which is typically used for tasks like zero-shot image classification, text-image retrieval, and more.
Using open_clip
import torch
from PIL import Image
import open_clip
model_name = "romrawinjp/clip-kd_ViT-T-16_Baseline-CC3M12M"
model, preprocess = open_clip.create_model_from_pretrained('hf-hub:'+model_name)
tokenizer = open_clip.get_tokenizer('hf-hub:'+model_name)
url = 'http://images.cocodataset.org/val2017/000000039769.jpg'
image = Image.open(requests.get(url, stream=True).raw)
image = preprocess(image).unsqueeze(0)
text = tokenizer(["a diagram", "a dog", "a cat"])
with torch.no_grad(), torch.cuda.amp.autocast():
image_features = model.encode_image(image)
text_features = model.encode_text(text)
image_features /= image_features.norm(dim=-1, keepdim=True)
text_features /= text_features.norm(dim=-1, keepdim=True)
text_probs = (100.0 * image_features @ text_features.T).softmax(dim=-1)
print("Label probs:", text_probs)
Using transformers library
from transformers import CLIPProcessor, CLIPModel
from PIL import Image
model_name = "romrawinjp/clip-kd_ViT-T-16_Baseline-CC3M12M"
model = CLIPModel.from_pretrained(model_name)
processor = CLIPProcessor.from_pretrained(model_name)
url = 'http://images.cocodataset.org/val2017/000000039769.jpg'
image = Image.open(requests.get(url, stream=True).raw)
text_labels = ["a photo of a cat", "a photo of a dog", "a photo of a bird"]
inputs = processor(text=text_labels, images=image, return_tensors="pt", padding=True)
outputs = model(**inputs)
logits_per_image = outputs.logits_per_image # This is the image-text similarity score
probs = logits_per_image.softmax(dim=1) # Convert logits to probabilities
Reference
Please refer to the original work.
@inproceedings{yang2024clip,
title={CLIP-KD: An Empirical Study of CLIP Model Distillation},
author={Yang, Chuanguang and An, Zhulin and Huang, Libo and Bi, Junyu and Yu, Xinqiang and Yang, Han and Diao, Boyu and Xu, Yongjun},
booktitle={Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition},
year={2024}
}
- Downloads last month
- 32
Inference Providers
NEW
This model is not currently available via any of the supported Inference Providers.