July 8, 2021
We’ve developed a new computer vision model called ConViT, which combines two widely used AI architectures — convolutional neural networks (CNNs) and Transformer-based models — in order to overcome some important limitations of each approach on its own. By leveraging both techniques, this vision Transformer-based model can outperform existing architectures, especially in the low data regime, while achieving similar performance in the large data setting.
AI researchers often use a particular set of assumptions — commonly known as inductive biases — when building new machine learning models and training paradigms because it can help models learn more generalizable solutions from less data. CNNs, which have proved extremely successful for vision tasks, rely on two of these inductive biases built into the architecture itself: that pixels near one another are related (locality) and that different portions of an image should be processed identically regardless of their absolute location (weight sharing).
In contrast, self-attention-based vision models (like Data-efficient image Transformers and Detection Transformers) feature minimal inductive biases. When trained on large data sets, such models have matched and sometimes exceeded the performance of CNNs. But they often struggle to learn meaningful representations when trained on small data sets.
AI researchers are therefore presented with a trade-off: CNNs’ strong inductive biases enable them to reach high performance even with minimal data (high floor), yet these same inductive biases may limit these models when large quantities of data are present (low ceiling). In contrast, Transformers feature minimal inductive biases, which can prove limiting in small data settings (low floor), but this same flexibility enables Transformers to outperform CNNs in large data regimes (high ceiling).
Our work, to be presented this month at ICML 2021, asks a simple question: Is it possible to design models that benefit from inductive biases when they are helpful but are not limited by them when better solutions can be learned from data? In other words, can we get the best of both worlds? To do this, our ConViT model is initialized with a “soft” convolutional inductive bias, which the model can learn to ignore if necessary.
Our goal with ConViT was to modify vision Transformers to impose a soft convolutional inductive bias, which encourages the network to act convolutionally but, critically, allows the model to decide for itself whether it wants to remain convolutional. To impose this soft inductive bias, we introduce gated positional self-attention (GPSA), a form of positional self-attention in which the model learns a gating parameter, lambda, which controls the balance between standard content-based self-attention and the convolutionally initialized positional self-attention.
Equipped with GPSA layers, ConViT outperforms the recently proposed Data-efficient image Transformers (DeiT) model of equivalent size and flops. For example, the ConViT-S+ slightly outperforms DeiT-B (82.2 percent vs. 81.8 percent) while using a little more than half as many parameters (48M vs. 86M). However, the improvement of ConViT is most dramatic in limited data regimes where the soft convolutional inductive bias plays a larger role. For example, when only 5 percent of the training data is used, ConViT dramatically outperforms DeiT (47.8 percent vs. 34.8 percent).
In addition to the performance advantages of ConViT, the gating parameter provides us with an easy way to understand the extent to which each layer remains convolutional after training. Across all layers, we found that ConViT paid progressively less attention to the convolutional positional attention over the course of training. For later layers, the gating parameter eventually converged close to 0, suggesting that the convolutional inductive bias is practically ignored. For early layers, however, many attention heads maintain high gating values, suggesting that the network uses the convolutional inductive bias in early layers to aid training.
The performance of AI models is enormously dependent on the type and amount of data with which they are trained. In research and even more so in real-world applications, we are often constrained in what data is available. We believe that ConViT — and more generally, the idea of imposing soft inductive biases that models can learn to ignore — is an important step forward in building more flexible AI systems that can perform well with whatever data they are provided. ConViT also helps us better understand how these models work by providing interpretable parameters, which can be leveraged to understand and debug these models. (Facebook AI is also exploring interpretability in other ways, such as with Captum, an open source library for model interpretability, our research on the functional role of easy-to-interpret neurons, and our research on “lottery ticket” initializations.)
We hope that our ConViT approach will inspire the community to explore other ways to move from hard inductive biases to soft inductive biases. To enable this, we have provided our code on GitHub, and our models are now integrated in the popular libraries Timm and VISSL.