arxiv:2104.14294

Emerging Properties in Self-Supervised Vision Transformers

Published on Apr 29, 2021

Authors:

Abstract

In this paper, we question if self-supervised learning provides new properties to Vision Transformer (ViT) that stand out compared to convolutional networks (convnets). Beyond the fact that adapting self-supervised methods to this architecture works particularly well, we make the following observations: first, self-supervised ViT features contain explicit information about the semantic segmentation of an image, which does not emerge as clearly with supervised ViTs, nor with convnets. Second, these features are also excellent k-NN classifiers, reaching 78.3% top-1 on ImageNet with a small ViT. Our study also underlines the importance of momentum encoder, multi-crop training, and the use of small patches with ViTs. We implement our findings into a simple self-supervised method, called DINO, which we interpret as a form of self-distillation with no labels. We show the synergy between DINO and ViTs by achieving 80.1% top-1 on ImageNet in linear evaluation with ViT-Base.

View arXiv page View PDF Add to collection

Community

TheProjectsGuy

Aug 5, 2023

Proposes Dino (self-distillation with no labels/supervision): a self supervised (pretraining) ViT method that learns better explicit information (than supervised ViTs and CNNs); uses momentum encoder (MoCo), multi-crop training, and small patch sizes. Visualizes self-attention (of CLS token) of the last layer/block; unsupervised object segmentation based on class. Learner student has to match output probability distribution of a fixed teacher by minimizing cross-entropy loss. Multi-crop strategy: teacher gets two global crops (large, 224px) and student gets many local crops (small, 96px), student has to associate embeddings of teacher. Teacher updated using exponential moving average (EMA) of student weights (steady and better performance). Uses 3 layer MLP as projection head (over backbone features which are used downstream). Center and sharpening to avoid collapse (apply to teacher only); center is EMA of teacher output. ViT implementation as in DeIT. Data augmentations as in BYOL (color jittering, blur, solarisation) and flipping. Does Linear and kNN classification (on CLS token) for ImageNet (better than other unsupervised methods: BarlowT, DCv2, InfoMin, BYOL, etc.). Also has NN Image retrieval (Google Landmarks v2 - GLDv2) and copy detection. Also has attention map visualizations and segmentations. Appendix compares this to other SSL frameworks, network ablations, projection head ablations, and has self-attention visualization results for different ImageNet classes. From Meta, Inria, Sorbonne University.