Audiovisual self-supervised representation learning

July 7, 2021

Humans don’t simply experience the world through multiple senses; we use these sensory inputs together to better understand what’s around us. If we hear a howl in the distance and turn to see a faraway wolf, we intuitively connect the noise and what we see. Likewise, if what we hear doesn’t seem to fit what we see — a tiny dog that suddenly roars like a lion — we are immediately surprised.

Facebook AI has leveraged the natural association between video and sound to teach machines to better understand the world. Like people, our techniques learn which sounds are likely to accompany different images. What’s more, the new methods described here use self-supervised learning — i.e., they rely only on the images and sounds in a particular video, rather than relying on human annotators to manually review the video and label the things they see and hear. These techniques can use videos and their accompanying audio to learn powerful features without using human labels. Our approach overcomes limitations of previous self-supervised systems and enables using self-supervised learning on noisy real-world data. We believe that our analysis applies more generally to most multimodal self-supervised learning and provides insights into the drawbacks of other contrastive learning techniques.

We are now sharing our code here so that others in the AI research community can build on our work and help us all advance the state of the art in this important field.

Our research has been accepted as two papers, AVID and RXID, at CVPR 2021. We are delighted AVID has been recognized as a best-paper candidate, while RXID has been accepted for an oral presentation.

Learning by contrasting both audio and video together

Our work builds upon standard contrastive learning methods, which learn representations by minimizing the distance in feature space between positive pairs of samples and maximizing the distance between negative pairs of samples. We train separate audio and visual models that create a shared representation space for the audio and video inputs. The models are trained using contrastive learning so that audio and visual features from the same clip (or instance) are close together in the feature space, while features from different instances are far away.

We first applied contrastive learning to develop audiovisual instance discrimination (AVID), a simple and easy-to-implement way to learn state-of-the-art representations for downstream tasks like action recognition and environmental sound classification.

AVID can work surprisingly well for video and audio recognition tasks. We find that by using the self-supervised signal from both video and audio, AVID’s video features are excellent at video tasks such as action recognition, and achieve state-of-the-art results on standard benchmarks like UCF-101 and HMDB-51 while being conceptually simpler than prior work.

We then improve AVID in two ways: We allow the model to look at additional related videos and audios while training using cross-modal agreement (CMA); and we use a new method called robust cross-modal instance discrimination (RXID) to make it more robust to noisy examples where the audio does not offer useful information about the contents of the video and vice versa.

Interestingly, we found the cross-modal objective plays a critical role in effective representation learning, outperforming equivalent monomodal self-supervised frameworks that use either audio or video (but not both). Since the models need to find associations between inputs of very different natures, the cross-modal instance discrimination promotes learning of high-level semantic features. This reinforces our view that multiple modalities, if available, can enhance instance discrimination methods.

Furthermore, by seeking to associate audio with the corresponding visual clips, AVID must identify which objects in the video produce the resulting audio. This means the model naturally focuses its attention on the specific regions that correspond to a sound. This is analogous to how people might fix their gaze on a passing fire truck when they hear the siren.

Something Went Wrong
We're having trouble playing this video.

The AVID model predicted which regions of the video input contributed the most to match the corresponding audio signals.

Beyond a good initialization for further fine-tuning, the feature space obtained with AVID already encodes semantic structures that align well with human judgments.

Beyond instance discrimination

While instance discrimination can achieve outstanding performance, it suffers from two drawbacks. The first is that it uses different instances as negatives, even when the subject matter is related. (For example, the model knows to associate a black-and-white and color version of the same dog photograph, but it doesn’t know that two different dog photos should be grouped together.) This is problematic, since these samples oppose the goal of representation learning: obtaining a semantic space where similar videos are close to each other. With CMA, we mitigate the influence of similar negatives by extending the set of positives to other similar instances. However, we found that grouping instances based solely on visual or audio appearance often fails. Instead, we rely on instances that agree in both modalities to provide more accurate positive sets.

RXID is an online way to mitigate the influence of similar negatives. Unlike CMA, we show that inter-instance similarity can be evaluated at each iteration in an online fashion, by optimizing an instance discrimination loss with a soft target distribution over instances. This makes RXID simpler to implement than CMA.

The second drawback is that audiovisual associations can sometimes be too weak to enable instance discrimination. For example, imagine a video of a person doing yoga in silence. We call these instances faulty positives. In these cases, establishing correspondences between audio and video clips is impossible, and forcing these correspondences can be detrimental to the learned representations. RXID addresses this problem by leveraging cross-modal similarities to identify likely faulty positives, and down-weighting their contribution to the final loss.

By addressing the two major issues of faulty positives and negatives, the cross-modal instance discrimination objective is less susceptible to these major sources of noisy training signals, and can thus lead to significant improvements in the learned representations.

AVID, CMA, and RXID pave the way for building more accurate representations of audio and video while avoiding the computational cost of labeling large amounts of data. We hope our work encourages the community to explore self-supervised learning from multiple modalities.

Written By

Pedro Morgado

Research Intern

Research Scientist