A state-of-the-art, self-supervised framework for video understanding

June 25, 2020

What the research is:

One of the hallmarks of human cognition is our ability to learn from the world around us without explicit training. Children, for example, learn to speak without depending on dictionaries or pronunciation guidelines. As an important milestone in our work toward emulating this ability in AI systems, we are sharing a new framework called Generalized Data Transformations. It achieves unprecedented performance in understanding the content of videos — without using labeled training data.

Generalized Data Transformations give us a systematic way of robustly learning the relationship between audio and visual information in order to learn about the structure of the world. This enables us to achieve record-breaking performance when we fine-tune the model for specific downstream tasks. The technique sets a new state of the art for video action recognition, retrieval, and few-shot learning, and audio classification.

Because it is self-supervised and does not rely on labeled examples, this approach greatly enhances our ability to build AI systems that can learn about the world from large numbers of videos, not just the small subset that have been manually labeled by human annotators.

How it works

Much previous research on self-supervised learning has focused on defining informative surrogate (or pretext) tasks for training neural networks. These tasks use some inherent quality of the data as a supervision signal. For example, a system might use the color in an image as supervision and learn to colorize black-and-white pictures. By learning the appropriate colors for each object, the system implicitly learns object features that will also be relevant for other tasks. Similarly, researchers might use future segments of a video to train a system to predict the pixels in the next frame.

Noise contrastive techniques train systems by making both semantically meaningful changes (such as swapping in an image of a cat in place of a dog) and changes that modify the input but do not change its semantics (such as cropping the video or adding a filter) to a particular video. The technique then constrains the learned representation to be invariant to the nuisance transformations and sensitive to the significant ones. This leads to robust and effective learning; however, previous applications of the technique have generally been effective only with a limited family of pretext tasks.

In our work, we introduce a framework that greatly extends the expressivity of noise-contrastive formulations for pretext tasks and demonstrates its power by learning from multimodal data. Our research focuses on cross-modal supervision: Rather than using a single modality, we learn from the relationships between the sound and images in a video. In our framework, a hierarchical sampling scheme is used to create a batch of data transformations (as illustrated in image A, below). A contrast matrix is then defined to specify which pairs of data transformations should the model learn invariance or distinctiveness from (as seen in image B, below). Lastly, we use the noise contrastive loss to learn a joint audio-visual embedding space where transformation pairs are either pulled together or pushed apart. For audio-visual representation learning, we desire representations that are invariant to modality, but distinctive to sample. To make this operational, we use convolutional neural networks to encode both audio and image clips into high-dimensional vectors. The parameters of the encoders are optimized so that the representations of co-temporal audio and visual clips are near each other in vector space (as illustrated in image C, below). Conversely, the encoding of audio and visual clips that have nothing to do with each other should be far apart.

This graphic shows a hierarchical sampling of generalized data transformations T = t_M ◦ ... ◦ t₁ for multi-modal training.

This image shows a subset of the c(T, T’) contrast matrix which indicates which transformation pairs are repelling (0) and attracting (1).

Something Went Wrong

We're having trouble playing this video.

Learn more

This illustration shows how the hierarchical family of transformations that we explore are mapped in joint audio-visual embedding space using cross-modal NCE objective function. See paper for full details.

We demonstrate that much past self-supervised and cross-modal research can be reduced to specific instantiations of the comprehensive Generalized Data Transforms framework.

Why it matters

To build truly intelligent machines, we must enable them to learn directly from the world without needing explicit guidance every step of the way. Being able to learn from the sights and sounds in the world as they occur, without explicit supervision, is the hallmark of this kind of learning. In contrast, the field of artificial intelligence has, since its inception, relied on people labeling massive amounts of data. At Facebook, and throughout the AI research community, the replacement of data-limited supervised learning with unlimited self-supervised learning is considered perhaps the most important frontier of artificial intelligence. Our work advances the state-of-the-art in this area, and is broadly applicable across numerous areas of research and applications. These include everything from detecting inappropriate content such as hate speech or harassment in videos to making more accurately personalized video recommendations and better user experiences in virtual reality.

Read the full paper:
https://arxiv.org/pdf/2003.04298.pdf