Update README.md
Browse files
README.md
CHANGED
|
@@ -17,15 +17,10 @@ January 2021
|
|
| 17 |
|
| 18 |
### Model Type
|
| 19 |
|
| 20 |
-
The
|
| 21 |
|
| 22 |
-
|
| 23 |
|
| 24 |
-
Initially, we’ve released one CLIP model based on the Vision Transformer architecture equivalent to ViT-B/32, along with the RN50 model, using the architecture equivalent to ResNet-50.
|
| 25 |
-
|
| 26 |
-
*This port does not include the ResNet model.*
|
| 27 |
-
|
| 28 |
-
Please see the paper linked below for further details about their specification.
|
| 29 |
|
| 30 |
### Documents
|
| 31 |
|
|
|
|
| 17 |
|
| 18 |
### Model Type
|
| 19 |
|
| 20 |
+
The model uses a ViT-B/32 Transformer architecture as an image encoder and uses a masked self-attention Transformer as a text encoder. These encoders are trained to maximize the similarity of (image, text) pairs via a contrastive loss.
|
| 21 |
|
| 22 |
+
The original implementation had two variants: one using a ResNet image encoder and the other using a Vision Transformer. This repository has the variant with the Vision Transformer.
|
| 23 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 24 |
|
| 25 |
### Documents
|
| 26 |
|