Vision Transformer

Vision Transformer (ViT) model pre-trained on ImageNet-21k (14 million images, 21,843 classes) at resolution 224x224. It was introduced in the paper An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale. The weights were converted from the ViT-L_16.npz file in GCS buckets presented in the original repository.

Downloads last month
168
Safetensors
Model size
327M params
Tensor type
F32
·
Inference Providers NEW
This model is not currently available via any of the supported third-party Inference Providers, and the HF Inference API does not support transformers models with pipeline type image-feature-extraction