Better computer vision models by combining Transformers and convolutional neural networks

July 8, 2021

What the research is:

We’ve developed a new computer vision model called ConViT, which combines two widely used AI architectures — convolutional neural networks (CNNs) and Transformer-based models — in order to overcome some important limitations of each approach on its own. By leveraging both techniques, this vision Transformer-based model can outperform existing architectures, especially in the low data regime, while achieving similar performance in the large data setting.

AI researchers often use a particular set of assumptions — commonly known as inductive biases — when building new machine learning models and training paradigms because it can help models learn more generalizable solutions from less data. CNNs, which have proved extremely successful for vision tasks, rely on two of these inductive biases built into the architecture itself: that pixels near one another are related (locality) and that different portions of an image should be processed identically regardless of their absolute location (weight sharing).

In contrast, self-attention-based vision models (like Data-efficient image Transformers and Detection Transformers) feature minimal inductive biases. When trained on large data sets, such models have matched and sometimes exceeded the performance of CNNs. But they often struggle to learn meaningful representations when trained on small data sets.

AI researchers are therefore presented with a trade-off: CNNs’ strong inductive biases enable them to reach high performance even with minimal data (high floor), yet these same inductive biases may limit these models when large quantities of data are present (low ceiling). In contrast, Transformers feature minimal inductive biases, which can prove limiting in small data settings (low floor), but this same flexibility enables Transformers to outperform CNNs in large data regimes (high ceiling).

Our work, to be presented this month at ICML 2021, asks a simple question: Is it possible to design models that benefit from inductive biases when they are helpful but are not limited by them when better solutions can be learned from data? In other words, can we get the best of both worlds? To do this, our ConViT model is initialized with a “soft” convolutional inductive bias, which the model can learn to ignore if necessary.

Soft inductive biases can help models learn without being restrictive. Hard inductive biases, such as the architectural constraints of CNNs, can greatly improve the sample-efficiency of learning but can become constraining when the size of the data set is not an issue. The soft inductive biases introduced by the ConViT avoid this limitation by vanishing when not required.

How it works:

Our goal with ConViT was to modify vision Transformers to impose a soft convolutional inductive bias, which encourages the network to act convolutionally but, critically, allows the model to decide for itself whether it wants to remain convolutional. To impose this soft inductive bias, we introduce gated positional self-attention (GPSA), a form of positional self-attention in which the model learns a gating parameter, lambda, which controls the balance between standard content-based self-attention and the convolutionally initialized positional self-attention.

The ConViT (left) is a version of the ViT in which some of the self-attention (SA) layers are replaced with gated positional self-attention layers (GPSA; right). Because GPSA layers involve positional information, the class token is concatenated with hidden representation after the last GPSA layer. FFN: feedforward network (two linear layers separated by a GeLU activation); W_qry: query weights; W_key: key weights; v_pos: attention center and span embeddings (learned); r_qk: relative position encodings (fixed); λ: gating parameter (learned); σ: sigmoid function.

Equipped with GPSA layers, ConViT outperforms the recently proposed Data-efficient image Transformers (DeiT) model of equivalent size and flops. For example, the ConViT-S+ slightly outperforms DeiT-B (82.2 percent vs. 81.8 percent) while using a little more than half as many parameters (48M vs. 86M). However, the improvement of ConViT is most dramatic in limited data regimes where the soft convolutional inductive bias plays a larger role. For example, when only 5 percent of the training data is used, ConViT dramatically outperforms DeiT (47.8 percent vs. 34.8 percent).

ConViT outperforms the DeiT both in sample and parameter efficiency. Left: We compare the sample efficiency of our ConViT-S with that of the DeiT-S by training them on subsets of ImageNet-1k with identical hyperparameters. We display the relative improvement of the ConViT over the DeiT in green. Right: We compare the top-1 accuracies of our ConViT models with those of other ViTs (diamonds) and CNNs (squares) on ImageNet-1k. The performance of other models on ImageNet is taken from Touvron et al., 2020; He et al., 2016; Tan & Le, 2019; Wu et al., 2020; and Yuan et al., 2021.

In addition to the performance advantages of ConViT, the gating parameter provides us with an easy way to understand the extent to which each layer remains convolutional after training. Across all layers, we found that ConViT paid progressively less attention to the convolutional positional attention over the course of training. For later layers, the gating parameter eventually converged close to 0, suggesting that the convolutional inductive bias is practically ignored. For early layers, however, many attention heads maintain high gating values, suggesting that the network uses the convolutional inductive bias in early layers to aid training.

This graphic shows several example attention maps for DeiT (b) and ConViT (c). σ(λ) represents the learnable gating parameter. Values close to 1 indicate the convolutional initialization is being used, whereas values close to 0 indicate only content-based attention is being used. Note that early ConViT layers partially maintain the convolutional initialization, while later layers are purely content-based.

Why it matters:

The performance of AI models is enormously dependent on the type and amount of data with which they are trained. In research and even more so in real-world applications, we are often constrained in what data is available. We believe that ConViT — and more generally, the idea of imposing soft inductive biases that models can learn to ignore — is an important step forward in building more flexible AI systems that can perform well with whatever data they are provided. ConViT also helps us better understand how these models work by providing interpretable parameters, which can be leveraged to understand and debug these models. (Facebook AI is also exploring interpretability in other ways, such as with Captum, an open source library for model interpretability, our research on the functional role of easy-to-interpret neurons, and our research on “lottery ticket” initializations.)

We hope that our ConViT approach will inspire the community to explore other ways to move from hard inductive biases to soft inductive biases. To enable this, we have provided our code on GitHub, and our models are now integrated in the popular libraries Timm and VISSL.

Read the full paper