ai-forever's picture
Create README.md
42a0887
|
raw
history blame
2.54 kB

ruclip-vit-large-patch14-336

RuCLIP (Russian Contrastive Language–Image Pretraining) is a multimodal model for obtaining images and text similarities and rearranging captions and pictures. RuCLIP builds on a large body of work on zero-shot transfer, computer vision, natural language processing and multimodal learning.

Model was trained by Sber AI and SberDevices teams.

  • Task: text ranking; image ranking; zero-shot image classification;
  • Type: encoder
  • Num Parameters: 430M
  • Training Data Volume: 240 million text-image pairs
  • Language: Russian
  • Context Length: 77
  • Transformer Layers: 12
  • Transformer Width: 768
  • Transformer Heads: 12
  • Image Size: 336
  • Vision Layers: 24
  • Vision Width: 1024
  • Vision Patch Size: 14

Usage Github

pip install ruclip
clip, processor = ruclip.load("ruclip-vit-large-patch14-336", device="cuda")

Performance

We have evaluated the performance on the following datasets:

Dataset Metric Name Metric Result
Food101 acc 0.712
CIFAR10 acc 0.906
CIFAR100 acc 0.591
Birdsnap acc 0.213
SUN397 acc 0.523
Stanford Cars acc 0.659
DTD acc 0.408
MNIST acc 0.242
STL10 acc 0.956
PCam acc 0.554
CLEVR acc 0.142
Rendered SST2 acc 0.539
ImageNet acc 0.488
FGVC Aircraft mean-per-class 0.075
Oxford Pets mean-per-class 0.546
Caltech101 mean-per-class 0.835
Flowers102 mean-per-class 0.517
HatefulMemes roc-auc 0.519

Authors