Accurate image and video classification is important for a wide range of computer vision applications, from identifying harmful content, to making products more accessible to the visually impaired, to helping people more easily buy and sell things on products like Marketplace. Facebook AI is developing alternative ways to train our AI systems so that we can do more with less labeled training data overall, and also deliver accurate results even when large, high-quality labeled data sets are simply not available. Today, we are sharing details on a versatile new model training technique that delivers state-of-the-art accuracy for image and video classification systems.
This approach, which we call semi-weak supervision, is a new way to combine the merits of two different training methods: semi-supervised learning and weakly supervised learning. It opens the door the door to creating more accurate, efficient production classification models by using a teacher-student model training paradigm and billion-scale weakly supervised data sets. If the weakly supervised data sets (such as the hashtags associated with publicly available photos) are not available for the target classification task, our method can also make use of unlabeled data sets to produce highly accurate semi-supervised models.
Our semi-weakly supervised training framework has let us set a new state of the art on academic benchmarks for lightweight image and video classification models. We achieved 81.2 percent top-1 accuracy on ImageNet using the ResNet-50 model for our benchmarking tests. In the case of Kinetics video action classification benchmark, we achieved 74.2 percent top-1 accuracy on the validation set with a low-capacity R(2+1)D-18 model. This is a 2.7 percent improvement over the previous state of the art results obtained by the same capacity weakly supervised R(2+1)D-18 model using the same input data sets and compute resources.
Semi-weakly supervised learning helps reduce the accuracy gap between the high-capacity state-of-the-art models and the computationally efficient production-grade models. Our approach is enabling Facebook to create efficient, low-capacity production-ready models that deliver substantially higher accuracy than was previously possible, which will improve products used by billions of people.
In training a target classification model using only labeled data, the accuracy of the target model is highly dependent on the scale and quality of the data set. But the human labeling of training data required for this fully supervised approach cannot scale to all the possible visual concepts in the world. Labeling thousands of species of plants and animals, for example, is resource intensive and requires extensive domain expertise.
In 2018, Facebook AI researchers demonstrated that we could use the hashtags associated with billions of publicly available Instagram photos to train highly accurate classification models. This approach identifies a set of related hashtags for the target classification task, uses associated images for pretraining, and then fine-tunes the target model with all the available labeled examples. This is a weakly supervised approach, since hashtagged data sets contain significant label noise — for example, tags such as “love” are used subjectively and idiosyncratically, and tags like “perseverance” refer to abstract concepts. But despite these challenges, we were able to train very large capacity weakly supervised models that delivered state-of-the-art accuracy. We open-sourced the classification models that produced these results on various benchmarks.
Although weak supervision has delivered noteworthy successes for well-known academic benchmarks, it has limitations. Hashtagged content is not always available for a particular classification task. On Facebook and Instagram, for example, a large amount of visual content doesn’t have any associated hashtag. And while publicly available unlabeled photos are extremely plentiful, weak supervision cannot use this data for pretraining models. Furthermore, the state-of-the-art weakly supervised classification models are of high capacity and computationally quite expensive. These constraints prompted our exploration of ways to make use of the immense amount of publicly available unlabeled data sets for building more accurate classification models.
Semi-supervised learning offers a different approach to decreasing AI systems’ dependence on labeled data sets. The method trains a target model using large amounts of unlabeled data in combination with a small set of labeled examples.
The first step is to train a larger-capacity and highly accurate “teacher” model with all available labeled data sets. The teacher model is designed to predict the labels and corresponding soft-max scores for all the unlabeled examples. These examples are then ranked against each concept class. Top-scoring examples are used for pretraining the lightweight, computationally highly efficient “student” classification model. The final step is to fine-tune the student model with all the available labeled data. The target model learns both from its teacher and the unlabeled datasets at the pre-training stage.
This proposed model training framework produces models with higher accuracy compared with the fully supervised regime, in which the target model is trained only on labeled data.
While this high-level description outlines the basic principles for semi-supervised learning, we have found that many nuanced decisions affect the performance of semi-supervised frameworks in practice. Furthermore, semi-supervised training has not previously been explored at this scale (with billions of content examples) for image and video classification models evaluated on competitive academic benchmarks.
Our new semi-weak supervision approach aims to improve upon the aforementioned semi-supervised framework by leveraging very large sets of weakly supervised data. The weak supervision in the form of hashtags is used for creating a more focused and relevant unlabeled data set for semi-supervision. Moreover, the same filtered data set is used for training a very large capacity weakly supervised teacher model for selecting the pretraining samples for the student model.
In order to evaluate the effectiveness of our framework, we used the publicly available ImageNet benchmark for photo classification and the Kinetics benchmark for video, in combination with several commonly used residual network models. In the case of ImageNet, the current state of the art accuracy is obtained by Facebook AI’s weakly supervised ResNeXt-101-32x48 model. It delivers 85.4 percent top-1 accuracy, outperforming by a significant margin the recently published EfficientNet network. . The model is pretrained on one billion public Instagram images, which contain 1,500 ImageNet-related hashtags. If we were to pretrain a target ResNet-50 ImageNet model using the same weakly supervised data set and approach, we would get 78.2 percent top-1 accuracy on the ImageNet benchmark. This mark serves as our baseline as we explore semi-weakly supervised alternatives.
In order to achieve the state of the art, our researchers used the weakly supervised ResNeXt-101-32x48 model teacher model to select pretraining examples from the same data set of one billion hashtagged images. The target ResNet-50 model is pretrained with the selected examples and then fine-tuned with the ImageNet training data set. The resulting semi-weakly supervised ResNet-50 model achieves 81.2 percent top-1 accuracy. This is the current state of the art for the ResNet-50 ImageNet benchmark model. The top-1 accuracy is 3 percent higher than the (weakly supervised) ResNet-50 baseline, which is pretrained and fine-tuned on the same data sets with exactly the same training data set and hyper-parameters.
Pretrained image classification models are widely used for various transfer learning tasks. We sought to explore how to fine-tune an existing classification model for use on a different classification task for which there are insufficient training examples available. The classification accuracy for the target task is highly dependent on the capacity, accuracy, and problem domain of the pretrained trunk model.
In this context, we evaluate the representation power of the semi-supervised and semi-weakly supervised ImageNet classification models. The last layer of the fully supervised ResNet-50 ImageNet model is fine-tuned for the CUB-2011 bird image classification task. The fine-tuned model provides 73.3 percent top-1 accuracy on the CUB-2011 benchmark, and our weakly supervised ResNet-50 ImageNet model achieves 74 percent top-1 accuracy.
Our highest-performing semi-weakly supervised ResNet-50 model improves upon this, however. It sets a new mark for transfer learning accuracy for CUB-2011 transfer learning task: 80.7 percent*. *This is a 6.7 percent improvement in accuracy with respect to the weakly supervised ResNet-50 model.
The teacher-student-based semi-supervised learning framework also generalizes to video classification tasks. Our evaluation uses the Kinetics-400 video action classification benchmark, and the state-of-the-art weakly supervised R(2+1)D-152 video classification model is used as the teacher.. The teacher model predicts labels for the same weakly supervised data set of 65 million publicly available Instagram videos with which it is pretrained. All the videos against each action class are ranked globally. Top-scoring 4K video examples are selected as part of the data set employed for pretraining the lower capacity R(2+1)D student model. Finally, in order to complete our benchmarking, the student model is fine-tuned with all the available labeled videos in the Kinetics-400 data set.
The weakly supervised teacher model, with 24x greater capacity than the student model, provides 82.8 percent top-1 accuracy on the validation set. For reference, training the student model with weak supervision provides 71.5 percent top-1 accuracy. Pretraining the student model with the examples from the same weakly supervised data set sampled by the state-of-the-art teacher model provides 74.2 percent top-1 accuracy. For the target R(2+1)D architecture, the semi-weakly supervised model provides a 9.4 percent accuracy improvement over the state of the art on the Kinetics-400 benchmark. It delivers a 2.7 percent accuracy improvement over the weakly supervised R(2+1)D models.
Facebook’s semi-supervised training framework improves the accuracy of lightweight production models by a large margin. More accurate lightweight classification models deployed to production will help us better understand the visual content and subsequently improve the user experience. Most important, more accurate models detect more bad content and keep Facebook safe.
We believe that learning from unlabeled data sets is the path forward for improving state-of-the-art classification models. Human annotation resources will continue to be resource intensive, difficult to scale, and sometimes simply unavailable. But ongoing hardware advances are making it easier to train on extremely large sets of photos or videos. Billion-scale unlabeled data sets will be an important tool for training highly accurate visual understanding models.
By developing training methods that do not rely solely on data that’s been labeled for training purposes by humans, we hope to develop systems that are more versatile and are able to generalize to unseen tasks — potentially bringing us closer to our goal of achieving AI with human-level intelligence.
Facebook AI is exploring self-supervised learning in a variety of other fields, including machine translation, where our model led the 2019 WMT international machine translation competition. We’ve achieved state-of-the-art results with RoBERTa, our optimized method for pretraining self-supervised NLP systems. We believe these efforts will help us further improve tools to keep people safe on our platforms,help people connect across different languages, and advance AI in new ways.
I. Zeki Yalniz
Research Scientist Manager