July 20, 2020
We’ve developed a new technique for self-supervised training of convolutional networks commonly used for image classification and other computer vision tasks. Our method now surpasses supervised approaches on most transfer tasks, and, when compared with previous self-supervised methods, models can be trained much more quickly to achieve high performance. For instance, our technique requires only 6 hours and 15 minutes to achieve 72.1 percent top-1 accuracy with a standard ResNet-50 on ImageNet, using 64 V100 16GB GPUs. Previous self-supervised methods required at least 6x more computing power and still achieved worse performance.
Training convolutional networks typically requires a large amount of annotated data, which has limited their applications in fields where accessing annotations is difficult. Recent improvements in self-supervised training methods — in particular, the introduction of contrastive approaches such as MoCo and PIRL — have made them a serious alternative to traditional supervised training. But these approaches have still lagged in performance and are significantly slower to train, often requiring 100x more computing power than their supervised counterparts. Our method leverages contrastive learning in a much more effective and efficient manner.
Contrastive learning is a powerful method to learn visual features without supervision. Instead of predicting a label associated with an image, contrastive methods train convolutional networks by discriminating between images. For example, the system might compare three images: a black-and-white photo of a cat, a color photo of the same cat, and a color drawing of a mountain. It could then learn that despite their visual differences, the first two have very similar semantic content, while the third does not. By leveraging the information that makes two images visually different, contrastive learning can discover semantics present in the images.
In practice, contrastive learning takes two different transformations of the cat image mentioned above and pushes them closer together than transformations of other images, such as the drawing of a mountain. This way the model learns that the notion of a cat is invariant to image transformations that affect, for example, its color. This approach works well but requires the system to transform the same image in many different ways and compare individually every possible pair of transformed images. This is an extremely computation-intensive task.
In this work, we propose an alternative that does not require an explicit comparison between every image pair. We first compute features of cropped sections of two images and assign each of them to a cluster of images. These assignments are done independently and may not match; for example, the black-and-white image version of the cat image could be a match with an image cluster that contains some cat images, while its color version could be a match with a cluster that contains different cat images. We constrain the two cluster assignments to match over time, so the system eventually will discover that all the images of cats represent the same information. This is done by contrasting the cluster assignments, i.e., predicting the cluster of one version of the image with the other version:
In addition, we introduce a multicrop data augmentation for self-supervised learning that allows us to greatly increase the number of image comparisons made during training without having much of an impact on the memory or compute requirements. We simply replace the two full-size images by a mix of crops with different resolutions. We find that this simple transformation works across many self-supervised methods.
Our approach allows researchers to train efficient, high-performance image classification models with no annotations or metadata. Since there are many domains where annotations are difficult or even impossible to collect, this work can help with many downstream applications. For example, removing the need for annotations benefits applications in which they require expert knowledge, like medical imaging, or are time-consuming, like fine-grained classification.
More broadly, we believe that self-supervised learning is key to building more flexible and useful AI. People learn skills such as speaking a language and recognizing objects without needing large-scale labeled datasets. Likewise, self-supervised training will enable AI to learn directly from the vast amount of information available in the world, rather than just from training data created specifically for AI research. Facebook AI is pursuing a wide range of research projects on self-supervised learning, including a new framework for video understanding, cross-lingual understanding, and 3D content understanding. By making it easier to train self-supervised image classification systems, the work — along with future improvements and additional resources that we are working on now — discussed here will help us advance toward that long-term goal.