openai
/

clip-vit-base-patch32

Zero-Shot Image Classification

Model card Files Files and versions

valhalla commited on Mar 14, 2022

Commit

f4881ba

·

1 Parent(s): 8759caf

Update README.md

Files changed (1) hide show

README.md +2 -7

README.md CHANGED Viewed

@@ -17,15 +17,10 @@ January 2021
 ### Model Type
-The base model uses a ViT-B/32 Transformer architecture as an image encoder and uses a masked self-attention Transformer as a text encoder. These encoders are trained to maximize the similarity of (image, text) pairs via a contrastive loss. There is also a variant of the model where the ResNet image encoder is replaced with a Vision Transformer.
-### Model Version
-Initially, we’ve released one CLIP model based on the Vision Transformer architecture equivalent to ViT-B/32, along with the RN50 model, using the architecture equivalent to ResNet-50.
-*This port does not include the ResNet model.*
-Please see the paper linked below for further details about their specification.
 ### Documents

 ### Model Type
+The model uses a ViT-B/32 Transformer architecture as an image encoder and uses a masked self-attention Transformer as a text encoder. These encoders are trained to maximize the similarity of (image, text) pairs via a contrastive loss.
+The original implementation had two variants: one using a ResNet image encoder and the other using a Vision Transformer. This repository has the variant with the Vision Transformer.
 ### Documents