jienengchen/ViTamin-XL-384px
Feature Extraction
•
Updated
•
1.44k
•
16
Designing Scalable Vision Models in the Vision-language Era. The best performing model is 'jienengchen/ViTamin-XL-384px'.
Note ViTamin-XL, with only 436M parameters and trained on the public DataComp-1B dataset, achieves an impressive 82.9% 🔥 zero-shot ImageNet accuracy.
Note ViTamin-L, with 333M parameters, sets a new SOTA 🔥 across seven benchmarks for open-vocabulary segmentation, and also push forward the capabilities of large multi-modal models 🌋 significantly.
Note achieves 70.8% zero-shot ImageNet accuracy with 88M parameters.
Note achieves 63.4% zero-shot ImageNet accuracy with 22M parameters.
Note achieves 68.9% zero-shot ImageNet accuracy with 88M parameters.
Note achieves 62.2% zero-shot ImageNet accuracy with 22M parameters.