Advancing self-supervision, CV, NLP to keep our platforms safe

May 01, 2019

Written byMichael Auli, Matt Feiszli, Deepti Ghadiyaram, Alex Kirillov, Holger Schwenk, Ves Stoyanov, Du Tran, Manohar Paluri

Written by

Michael Auli, Matt Feiszli, Deepti Ghadiyaram, Alex Kirillov, Holger Schwenk, Ves Stoyanov, Du Tran, Manohar Paluri


We use AI in a wide range of applications at Facebook today — and one of the most important is helping keep people safe on our platforms. In order to make all these systems more effective, we need to continue to improve our AI in two areas in particular: understanding content and working effectively with less labeled training data.

Our recent advances in natural language processing (NLP) and computer vision (CV) show how work in content understanding is producing benefits. In NLP, we've developed a shared multilingual embedding space that can act as a sort of lingua franca to help take action on harmful content even in low-resource languages. In CV, we've built on our industry-leading research to identify content in more parts of an image and achieve record-setting accuracy using hashtags for video understanding.

As our ability to understand content continues to improve across modalities, we've also made progress in the new frontier of self-supervision. This technique will accelerate learning by pretraining systems, and it can be the foundation for the next generation of quicker, nimbler tools.

We will highlight here how we’re improving the accuracy and efficiency of our content understanding systems and finding new ways to do more with less supervised learning.

Using multilingual sentence embeddings to tackle harmful content

To detect when people post something that violates our policies, our systems need to understand language. Specifically, our systems use machine learning (ML) to scan a given sentence and answer a range of questions, such as “Is it hateful?” or “Is it bullying someone?” Using answers from questions like these — along with the context of the interaction and other signals — we can determine whether to take action, such as flagging to a human reviewer.

For our ML systems to answer these questions, we need to train them with thousands of examples in a given language. And with approximately 6,500 spoken languages in the world, including ones that currently lack large training data sets, it is a challenge to find enough examples to develop content understanding in all the languages we support.

By mapping similar sentences in multiple languages in a shared embedding space, we can better understand relevant content — including policy violations — without translating each sentence.

To help offset this scarcity of training data, we’re leveraging our recently open-sourced toolkit LASER (Language-Agnostic SEntence Representations), which trains a single model to understand scores of languages. Where previously we’ve needed to use a different model for each language, LASER’s representation space allows us to train in one language, and then apply the model to a range of languages without requiring language-specific training data, and without having to translate them, which is referred to as “zero-shot transfer learning.” LASER also lets us identify sentences that are similar in meaning by mapping those sentences closer to each other within a language-agnostic representation space.

For researchers looking to increase the number of languages their systems can understand, cross-lingual techniques like this offer a more scalable alternative to trying to gather and annotate data in every language. This approach also allows us to mine parallel training data for machine translation, and is especially useful for low-resource languages (for which we have fewer training examples). Identifying similar sentences across languages can help catch similar violations in multiple languages, all at the same time. To generate each sentence-level embedding, we first represent the words of a given sentence using byte-pair encoding, then use a five-layer bidirectional LSTM (long short-term memory) model, followed by max pooling (because sentences contain an arbitrary number of words).

By training this system at a massive scale — for 93 languages, belonging to more than 30 language families and written in 22 different scripts — we’re able to obtain sentence embeddings that are language-agnostic and whose ability to support automatic detection of policy violations is especially relevant for low-resource languages.

This approach, together with our cross-lingual pretraining work, will improve our ability to tackle hate speech, bullying, and other violations in multiple languages without requiring additional in-language labeled training data. Both techniques will bolster our existing use of multilingual word embeddings, which map similar words from different languages into the same space (as opposed to LASER’s sentence-level maps). These embeddings are already deployed into production for a wide range of cross-lingual understanding tasks, including identifying content violations.

Advancing the state of the art in photo and video understanding

People share billions of photos on our platforms, and understanding the context of what’s in them is important for keeping people safe. And even though a straightforward analysis of pixels might be enough for our system to recognize individual objects in a picture, we push our industry-leading CV capabilities even further and teach systems to understand when the relationship between those objects represents a policy violation.

Our systems have excelled at identifying items in the foreground of a photo, such as a dog or a ball. But until recently they struggled to understand the larger, less-contained collections of pixels that constitute the photo’s background. Using a new approach to object recognition, called a panoptic feature pyramid network (Panoptic FPN), we can perform instance segmentation tasks (for the foreground) and semantic segmentation tasks (for the background) at the same time, on a single, unified neural architecture.

Our CV systems have recognized progressively more image components over the years and can now perform detection of objects in both the foreground and the background with a single network. This results in better understanding of a photo’s overall context, as well as more computationally efficient image recognition.

Our results show that a Panoptic FPN can almost halve the overall computation needed to perform instance and semantic segmentation, compared with networks that do only one or the other. In practice, this gives the system greater contextual understanding of an image, which is important when deciding whether it violates our policies. But this work could affect other applications, too, such as potentially improving the automatic alt text that we use to describe images to the visually impaired.

Finding policy violations within video is orders of magnitude harder than in photos. Making sense of video means accounting for the large number of images that make up a given sequence of frames and the motion represented in that sequence, while also processing nonvisual input, such as audio.

Because of the difficulty, video understanding is in its infancy. We’re consistently pushing the state of the art, both in terms of accuracy and efficiency, in part by focusing our system’s attention and training on the most relevant data. For example, by factorizing our 3D convolutions to separate 2D and 1D convolutions (related to space and time, respectively, in a given video sequence), we’ve reduced the number of trainable parameters. Alternatively, we can keep the same number of parameters and improve accuracy. Using this framework, we can find the balance between accuracy and efficiency.

Rather than passing every frame in a given video through a spatiotemporal convolutional neural network, our saliency sampler approach isolates clips containing notable actions for further processing.

To understand what is happening in a video, we break it into short clips (each consisting of a small number of consecutive frames) and send a small set of consecutive frames through our latest spatiotemporal model. We can then aggregate this information and get predictions for the whole video.

In many videos, however, only a few clips have information that’s salient to a specific task, such as detecting bullying, and the rest are either redundant or irrelevant. So, to further improve both our speed and efficiency in spotting actionable events in video, we built a saliency sampler. This system is trained to focus on sections that contain specific behaviors, and then process only those sets of frames in more detail. This more focused analysis and training has led to faster and more accurate video understanding.

Record-setting accuracy using hashtags for video understanding

We’ve also developed a different approach that sets a new state of the art for recognizing actions, including ones that indicate content violations.

This technique builds directly on work we announced at F8 last year, which trained networks using billions of public images with hashtags and was able to beat the state of the art in image recognition tasks. In our new approach, hashtagged videos functioned as weakly supervised data, meaning training examples whose labels had been applied by people, but without the precision of full supervision.

The resulting annotations were noisy and imprecise compared with labels applied specifically for the purpose of training AI models. But the number of labeled examples that this approach provided showed us that we could significantly improve video understanding by training not on on weakly supervised training data, but also on an unprecedented amount of it.

In this case, the largest data set that we trained on consisted of more than 65 million public Instagram videos with hashtags. In comparison, current action classification data sets only consist of a few hundred thousand videos. Using these videos imposed technical challenges that were similar to our billion-scale image recognition work, such as having to distribute training across hardware, as well as new hurdles, including dealing with the fact that hashtags often apply to only a small portion of a video — a clip tagged with #wedding and #dance might feature only a few seconds of a newly married couple dancing, within a much longer video.

Despite this temporal noise issue, we found that the diversity of content and sheer scale of examples offset the label noise. And by utilizing our saliency sampler, our video recognition model has achieved state-of-the-art accuracy on three major video classification benchmarks. That includes reaching 82.8 percent accuracy on the Kinetics data set when classifying videos into one of 400 different human action categories. This is a 5.1 percent improvement over the previous state of the art’s 77.7 percent accuracy — representing a relative reduction in errors of more than 25 percent. We have applied this approach to our systems in production, improving our detection rates for bullying by almost 85 percent.

And by incorporating audio into this model, we’re able to get even better results. Our experiments have demonstrated that, compared with visual models using the same architecture and training process, our joint audio-video model set a new state of the art on the AudioSet audio event detection benchmark — and delivered a 20 percent improvement in accuracy for detecting profanity and adult content.

The self-supervised future of content understanding

These advances in language, image, and video understanding are part of an ongoing effort to improve our policy enforcement. But as we look to the long-term mission of keeping our platform safe, it will be increasingly important to create systems that can be trained using large amounts of unlabeled data.

The majority of our systems today rely on supervised training. This can lead to a range of training challenges, such as a scarcity of training data in some cases, and long training times as we gather and label examples to build new classifiers from scratch. Since new instances of content violations evolve quickly, and events such as elections have become flashpoints for harmful content, we have a responsibility to speed the development of systems that can improve our ability to respond.

One potential answer is an approach that Facebook Chief AI Scientist, Yann LeCun, has been discussing for years: self-supervision. Instead of relying solely on data that’s been labeled for training purposes by humans — or even on weakly supervised data, such as images and videos with public hashtags — self-supervision lets us take advantage of entirely unlabeled data. The approach is inherently versatile, enabling self-supervised systems to use a small amount of labeled data to generalize to unseen tasks, and potentially bringing us closer to our goal of achieving AI with human-level intelligence.

What was once essentially a research strategy for our AI teams has recently transitioned to systems that are delivering strong internal results, with some self-supervised language understanding models consistently beating systems that were trained using traditional, fully supervised methods. Specifically, we’ve developed models that learn to predict one part of a given signal by training on the rest of that signal.

For example, we trained one of these self-supervised systems to better understand language by masking words in sentences, even when the model has never seen that exact sentence before. Given a phrase like “A conversation about ________ and human connection,” people can easily guess several words that would fill the gap. But this task is more challenging for AI. This is the foundation for a useful and scalable training task, similar to the task solved by the BERT model that Google introduced concurrently. We can blank out each word of a sentence in turn and repeat the process for a billion words — with no labeling required.

By separately analyzing the context of a sentence to the left and to the right of a masked word, our bidirectional transformer model is able to predict the missing word without relying on labeled data.

To predict each hidden word, we use bidirectional transformer networks that model the rest of the sentence by computing the forward and backward states of the sentence — the words to the right and to the left of the mask — and then combine those representations to determine the center word. Once the system has been trained in this unlabeled manner, we can then use labeled data to fine-tune it for a specific task, such as identifying hate speech. In internal tests, this blending of self-supervised and supervised training allowed us to train on 10x less data than fully supervised models for similar accuracy, or achieve a 20 percent error reduction using the same amount of data.

We’re also using self-supervised training to improve speech recognition. We created several versions of an audio clip in which a section has been changed, and the model must determine which version is correct using only raw audio as input, with no transcriptions or other labels.

For this method, we use two networks stacked on top of each other: an encoder network that maps raw audio to a feature representation at a lower temporal frequency, and a context network that predicts the correct audio. To make the task more effective for training, we make the prediction problem increasingly difficult by asking the context network to predict further and further into the future.

After using two convolutional neural networks to pretrain a model on raw, unlabeled audio data, the system is optimized to solve a task that becomes increasingly difficult: predicting audio at various time steps, with the arrows indicating predictions that are further in the future.

Once this pretrained, self-supervised model has developed a strong understanding of speech, we use a small amount of supervised data: 80 hours of transcribed audio to train the final speech recognition system. Our system uses 150x less labeled data than Deep Speech 2, the previous best comparable system, while reducing the word error rate by 9 percent. This work enables us to quickly expand our speech recognition capabilities to many more languages — without needing large quantities of transcribed speech in each one.

Both of these approaches focus on speech and language understanding, but they also represent a more foundational shift in how we’re exploring and even combining different degrees of data supervision. That includes leveraging large amounts of unlabeled training data, as well as using small amounts of labeled data to unlock the vast potential of self-supervised systems. And of all the AI-related tasks that an increased emphasis on self-supervision can accelerate, none is as important as improving the safety of the people who use our products.

Written by

Michael Auli

Research Scientist, Facebook AI

Matt Feiszli

Research Scientist, Facebook AI

Deepti Ghadiyaram

Research Scientist, Facebook AI

Alex Kirillov

Research Scientist, Facebook AI

Holger Schwenk

Research Scientist, Facebook AI

Ves Stoyanov

Applied Researcher Manager, Facebook AI

Du Tran

Research Scientist, Facebook AI

Manohar Paluri

Director, Facebook AI