Vision Transformer
Vision Transformer (ViT) model pre-trained on ImageNet-21k (14 million images, 21,843 classes) at resolution 224x224. It was introduced in the paper An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale.
The weights were converted from the ViT-L_16.npz
file in GCS buckets presented in the original repository.
- Downloads last month
- 9
Inference API (serverless) does not yet support transformers models for this pipeline type.