Billion-scale semi-supervised learning for state-of-the-art image and video classification

October 18, 2019

Accurate image and video classification is important for a wide range of computer vision applications, from identifying harmful content, to making products more accessible to the visually impaired, to helping people more easily buy and sell things on products like Marketplace. Facebook AI is developing alternative ways to train our AI systems so that we can do more with less labeled training data overall, and also deliver accurate results even when large, high-quality labeled datasets are simply not available. Today, we are sharing details on a versatile new model training technique that delivers state-of-the-art accuracy for image and video classification systems.

This approach, which we call semi-weak supervision, is a new way to combine the merits of two different training methods: semi-supervised learning and weakly supervised learning. It opens the door the door to creating more accurate, efficient production classification models by using a teacher-student model training paradigm and billion-scale weakly supervised datasets. If the weakly supervised datasets (such as the hashtags associated with publicly available photos) are not available for the target classification task, our method can also make use of unlabeled datasets to produce highly accurate semi-supervised models.

Our semi-weakly supervised training framework has let us set a new state of the art on academic benchmarks for lightweight image and video classification models. We achieved 81.2 percent top-1 accuracy on ImageNet using the ResNet-50 model for our benchmarking tests. In the case of Kinetics video action classification benchmark, we achieved 74.2 percent top-1 accuracy on the validation set with a low-capacity R(2+1)D-18 model. This is a 2.7 percent improvement over the previous state of the art results obtained by the same capacity weakly supervised R(2+1)D-18 model using the same input datasets and compute resources.

Semi-weakly supervised learning helps reduce the accuracy gap between the high-capacity state-of-the-art models and the computationally efficient production-grade models. Our approach is enabling Facebook to create efficient, low-capacity production-ready models that deliver substantially higher accuracy than was previously possible, which will improve products used by billions of people.

Moving beyond labeled datasets

In training a target classification model using only labeled data, the accuracy of the target model is highly dependent on the scale and quality of the dataset. But the human labeling of training data required for this fully supervised approach cannot scale to all the possible visual concepts in the world. Labeling thousands of species of plants and animals, for example, is resource intensive and requires extensive domain expertise.

In 2018, Facebook AI researchers demonstrated that we could use the hashtags associated with billions of publicly available Instagram photos to train highly accurate classification models. This approach identifies a set of related hashtags for the target classification task, uses associated images for pretraining, and then fine-tunes the target model with all the available labeled examples. This is a weakly supervised approach, since hashtagged datasets contain significant label noise — for example, tags such as “love” are used subjectively and idiosyncratically, and tags like “perseverance” refer to abstract concepts. But despite these challenges, we were able to train very large capacity weakly supervised models that delivered state-of-the-art accuracy. We open-sourced the classification models that produced these results on various benchmarks.

Although weak supervision has delivered noteworthy successes for well-known academic benchmarks, it has limitations. Hashtagged content is not always available for a particular classification task. On Facebook and Instagram, for example, a large amount of visual content doesn’t have any associated hashtag. And while publicly available unlabeled photos are extremely plentiful, weak supervision cannot use this data for pretraining models. Furthermore, the state-of-the-art weakly supervised classification models are of high capacity and computationally quite expensive. These constraints prompted our exploration of ways to make use of the immense amount of publicly available unlabeled datasets for building more accurate classification models.

Facebook’s semi-supervised training framework

This figure shows our semi-supervised training framework which employs best practices to generate lightweight image and video classification models

Semi-supervised learning offers a different approach to decreasing AI systems’ dependence on labeled datasets. The method trains a target model using large amounts of unlabeled data in combination with a small set of labeled examples.

The first step is to train a larger-capacity and highly accurate “teacher” model with all available labeled datasets. The teacher model is designed to predict the labels and corresponding soft-max scores for all the unlabeled examples. These examples are then ranked against each concept class. Top-scoring examples are used for pretraining the lightweight, computationally highly efficient “student” classification model. The final step is to fine-tune the student model with all the available labeled data. The target model learns both from its teacher and the unlabeled datasets at the pre-training stage.

This proposed model training framework produces models with higher accuracy compared with the fully supervised regime, in which the target model is trained only on labeled data.

While this high-level description outlines the basic principles for semi-supervised learning, we have found that many nuanced decisions affect the performance of semi-supervised frameworks in practice. Furthermore, semi-supervised training has not previously been explored at this scale (with billions of content examples) for image and video classification models evaluated on competitive academic benchmarks.

Achieving state-of-the-art results with semi-weak supervision

Our new semi-weak supervision approach aims to improve upon the aforementioned semi-supervised framework by leveraging very large sets of weakly supervised data. The weak supervision in the form of hashtags is used for creating a more focused and relevant unlabeled dataset for semi-supervision. Moreover, the same filtered dataset is used for training a very large capacity weakly supervised teacher model for selecting the pretraining samples for the student model.

The framework improves upon the semi-supervised framework above by leveraging weakly supervised datasets (if available) for both training the teacher and student models.

In order to evaluate the effectiveness of our framework, we used the publicly available ImageNet benchmark for photo classification and the Kinetics benchmark for video, in combination with several commonly used residual network models. In the case of ImageNet, the current state of the art accuracy is obtained by Facebook AI’s weakly supervised ResNeXt-101-32x48 model. It delivers 85.4 percent top-1 accuracy, outperforming by a significant margin the recently published EfficientNet network. . The model is pretrained on one billion public Instagram images, which contain 1,500 ImageNet-related hashtags. If we were to pretrain a target ResNet-50 ImageNet model using the same weakly supervised dataset and approach, we would get 78.2 percent top-1 accuracy on the ImageNet benchmark. This mark serves as our baseline as we explore semi-weakly supervised alternatives.

In order to achieve the state of the art, our researchers used the weakly supervised ResNeXt-101-32x48 model teacher model to select pretraining examples from the same dataset of one billion hashtagged images. The target ResNet-50 model is pretrained with the selected examples and then fine-tuned with the ImageNet training dataset. The resulting semi-weakly supervised ResNet-50 model achieves 81.2 percent top-1 accuracy. This is the current state of the art for the ResNet-50 ImageNet benchmark model. The top-1 accuracy is 3 percent higher than the (weakly supervised) ResNet-50 baseline, which is pretrained and fine-tuned on the same datasets with exactly the same training dataset and hyper-parameters.

In this chart, 81.2 percent represents the accuracy achieved by the state-of-the-art high-capacity ResNeXt-101-23-48 model. When trained with our semi-weakly supervised method, the lower capacity ResNet-50 model achieves 81.2 percent accuracy, greatly reducing the gap between low-capacity and high-capacity models.

Representation power analysis

Pretrained image classification models are widely used for various transfer learning tasks. We sought to explore how to fine-tune an existing classification model for use on a different classification task for which there are insufficient training examples available. The classification accuracy for the target task is highly dependent on the capacity, accuracy, and problem domain of the pretrained trunk model.

In this context, we evaluate the representation power of the semi-supervised and semi-weakly supervised ImageNet classification models. The last layer of the fully supervised ResNet-50 ImageNet model is fine-tuned for the CUB-2011 bird image classification task. The fine-tuned model provides 73.3 percent top-1 accuracy on the CUB-2011 benchmark, and our weakly supervised ResNet-50 ImageNet model achieves 74 percent top-1 accuracy.

Our highest-performing semi-weakly supervised ResNet-50 model improves upon this, however. It sets a new mark for transfer learning accuracy for CUB-2011 transfer learning task: 80.7 percent*. *This is a 6.7 percent improvement in accuracy with respect to the weakly supervised ResNet-50 model.

Semi-weak supervision for video classification

In this chart, 82.8 percent represents the top-1 accuracy achieved by the state-of-the-art high-capacity R(2+1)D-152 model. When trained with our semi-weakly supervised method, the lower capacity R(2+1)D-18 model achieves 74.2 accuracy, greatly reducing the gap between low-capacity student and high-capacity teacher model.

The teacher-student-based semi-supervised learning framework also generalizes to video classification tasks. Our evaluation uses the Kinetics-400 video action classification benchmark, and the state-of-the-art weakly supervised R(2+1)D-152 video classification model is used as the teacher.. The teacher model predicts labels for the same weakly supervised dataset of 65 million publicly available Instagram videos with which it is pretrained. All the videos against each action class are ranked globally. Top-scoring 4K video examples are selected as part of the dataset employed for pretraining the lower capacity R(2+1)D student model. Finally, in order to complete our benchmarking, the student model is fine-tuned with all the available labeled videos in the Kinetics-400 dataset.

The weakly supervised teacher model, with 24x greater capacity than the student model, provides 82.8 percent top-1 accuracy on the validation set. For reference, training the student model with weak supervision provides 71.5 percent top-1 accuracy. Pretraining the student model with the examples from the same weakly supervised dataset sampled by the state-of-the-art teacher model provides 74.2 percent top-1 accuracy. For the target R(2+1)D architecture, the semi-weakly supervised model provides a 9.4 percent accuracy improvement over the state of the art on the Kinetics-400 benchmark. It delivers a 2.7 percent accuracy improvement over the weakly supervised R(2+1)D models.

The future of visual content understanding

Facebook’s semi-supervised training framework improves the accuracy of lightweight production models by a large margin. More accurate lightweight classification models deployed to production will help us better understand the visual content and subsequently improve the user experience. Most important, more accurate models detect more bad content and keep Facebook safe.

We believe that learning from unlabeled datasets is the path forward for improving state-of-the-art classification models. Human annotation resources will continue to be resource intensive, difficult to scale, and sometimes simply unavailable. But ongoing hardware advances are making it easier to train on extremely large sets of photos or videos. Billion-scale unlabeled datasets will be an important tool for training highly accurate visual understanding models.

By developing training methods that do not rely solely on data that’s been labeled for training purposes by humans, we hope to develop systems that are more versatile and are able to generalize to unseen tasks — potentially bringing us closer to our goal of achieving AI with human-level intelligence.

Facebook AI is exploring self-supervised learning in a variety of other fields, including machine translation, where our model led the 2019 WMT international machine translation competition. We’ve achieved state-of-the-art results with RoBERTa, our optimized method for pretraining self-supervised NLP systems. We believe these efforts will help us further improve tools to keep people safe on our platforms,help people connect across different languages, and advance AI in new ways.