December 13, 2022
Many recent breakthroughs in AI have been powered by self-supervised learning, which enables machines to learn without relying on labeled data. But current algorithms have several significant limitations, often including being specialized for a single modality (such as images or text) and requiring lots of computational power. This contrasts with human learning: People appear to learn much more efficiently than current AI, and also learn from different kinds of information in a similar way, rather than relying on separate learning mechanisms for text, speech, and other modalities.
Meta AI addressed one of these limitations earlier this year when we released data2vec, the first high-performance self-supervised algorithm to learn the same way for three different modalities: speech, vision, and text. Data2vec made it much easier to apply research advances in, say, text understanding to an image segmentation or speech translation task.
Today, we’re sharing data2vec 2.0, a new algorithm that is vastly more efficient and outperforms its predecessor’s strong performance. It achieves the same accuracy as the most popular existing self-supervised algorithm for computer vision but does so 16x faster.
To make our research accessible to other researchers, we are now sharing the code and pretrained models.
The general idea of self-supervised learning is for machines to learn the structure of images, speech, and text simply by observing the world. Advances in this area have led to many breakthroughs in speech (e.g., wav2vec 2.0), computer vision (e.g., masked autoencoders), and natural language processing (e.g., BERT). But modern systems can be computationally demanding, as training very large models requires many GPUs.
Similar to the original data2vec algorithm, data2vec 2.0 predicts contextualized representations of the data — or the layers of a neural network — instead of the pixels of an image, the words of a text passage, or the sounds of speech. Unlike with most other algorithms, these so-called target representations are contextualized, meaning they take the entire training example into account. For instance, the representation of the word bank is based on the entire sentence that the word appears in, and it is therefore easier to represent the correct meaning of the word (“financial institution” or “ground next to river”). We believe that contextualized targets lead to a richer learning task and enable data2vec 2.0 to learn faster than other algorithms.
We improved the efficiency of the original data2vec algorithm in several ways: First, we take target representations built for a particular training example and reuse them for masked versions (where we hide different random parts of the training example). We take each version and feed it into the student model, which predicts the same contextualized target representation for the different masked versions. This effectively amortizes the computational effort required to create target representations. Second, and similar to masked autoencoders, we do not run the student encoder network for the parts of the training examples that are blanked out (which is about 80 percent of an image in our case), thereby saving significant compute cycles. Finally, we use a more efficient decoder model that relies not on Transformer networks but on a multilayer convolutional network.
To get a better understanding of how much more efficient data2vec 2.0 is than its predecessor and other algorithms, we tested it on computer vision, speech, and text tasks on widely used benchmarks. We were looking at the final accuracy and the time it took to pretrain the model. We measured the speed of algorithms on the same hardware (number of GPUs, etc.).
For computer vision, we evaluated data2vec 2.0 on the standard ImageNet-1K image classification benchmark, where it learned to represent images. Data2vec 2.0 can equal the accuracy of masked autoencoders (MAE) but is 16x faster (measured in wall clock time in a like-for-like setting). If we give the algorithm more time, it can achieve even higher accuracy while still being faster than MAE.
For speech, we tested it on the LibriSpeech speech recognition benchmark, where it performed more than 11 times faster than wav2vec 2.0 with similar accuracy. For natural language processing (NLP), we evaluated data2vec 2.0 on the popular General Language Understanding Evaluation (GLUE) benchmark, where it achieved the same accuracy as RoBERTa, a reimplementation of BERT, in half the training time.
We are on a journey to build more general and efficient self-supervised algorithms that can use a single learning objective to learn from different modalities. The ability to learn more efficiently is particularly important for modalities such as video, which require a lot of computational effort to process. We hope that more efficient self-supervised learning algorithms such as data2vec 2.0 will lead to machines that can deeply understand extremely complex data, such as the contents of an entire movie.
Access the open source code and pretrained models here, and read the paper hear.
Efficient self-supervised learning with contextualized target representations for speech, vision, and language
This blog post was made possible by the work of Alexei Baevski, Arun Babu, Wei-Ning Hsu, and Michael Auli.
The second graphic in this post has been updated to correct a typographical error.