AI advances to better detect hate speech

May 12, 2020

We have a responsibility to keep the people on our platforms safe, and dealing with hate speech is one of the most complex and important components of this work. To better protect people, we have AI tools to quickly — and often proactively — detect this content. As noted in the Community Standards Enforcement Report released today, AI now proactively detects 88.8 percent of the hate speech content we remove, up from 80.2 percent the previous quarter. In the first quarter of 2020, we took action on 9.6 million pieces of content for violating our hate speech policies — an increase of 3.9 million.

This progress is due in large part to our recent AI advances in two key areas:

Developing a deeper semantic understanding of language, so our systems detect more subtle and complex meanings.
Broadening how our tools understand content, so that our systems look at the image, text, comments, and other elements holistically.

For semantic understanding of language, we’ve recently deployed new technologies such as XLM, Facebook AI’s method of self-supervised pretraining across multiple languages. Further, we are working to advance these systems by leveraging new state-of-the-art models such as XLM-R, which incorporates RoBERTa, Facebook AI’s state-of-the-art self-supervised pretraining method.

To broaden how our tools understand content, we have also built a pre-trained universal representation of content for integrity problems. This whole entity understanding system is now used at scale to analyze content to help determine whether it contains hate speech. More recently, we have further improved the whole entity understanding system by using post-level, self-supervised learning.

These challenges are far from solved, and our systems will never be perfect. But by breaking new ground in research, we hope to make further progress in using AI to detect hate speech, remove it quickly, and keep people safe on our platforms.

Something Went Wrong

We're having trouble playing this video.

Learn more

We’ve expanded both the breadth and depth of our content understanding systems, which has helped us improve detection of hate speech and other harmful content.

The challenge of teaching machines to recognize hate speech effectively

Facebook has established clear rules on what constitutes hate speech, but it is challenging to detect hate speech in all its forms; across hundreds of languages, regions, and countries; and in cases where people are deliberately trying to avoid being caught. Context and subtle distinctions of language are key. Idioms and nuances can vary widely across cultures, languages, and regions. Even expert human reviewers can sometimes struggle to distinguish a cruel remark from something that falls under the definition of hate speech or miss an idiom that isn’t widely used. What’s more, hate speech can come in any language and virtually any medium, such as a post, an image, a photo caption, or a video. We have found that a substantial percentage of hate speech on Facebook globally occurs in photos or videos. As with other content, hate speech also can be multimodal: A meme might use text and image together to attack a particular group of people, for example.

This example illustrates how hate speech can be multimodal. The text alone is ambiguous. But when it’s combined with the image, the statement takes on another meaning.

What’s more, people sharing hate speech often try to elude detection by modifying their content. This sort of adversarial behavior ranges from intentionally misspelling words or avoiding certain phrases to modifying images and videos.

As we improve our systems to address these challenges, it’s crucial to get it right. Mistakenly classifying content as hate speech can mean preventing people from expressing themselves and engaging with others. Counterspeech — a response to hate speech that may include the same offensive terms — is particularly challenging to classify correctly because it can look so similar to the hate speech itself.

Finally, the relative scarcity of examples of these violations poses an additional challenge for training our tools. To build models that understand linguistic and cultural nuances across the many languages on our platform, we need to be able to understand and learn from not only the limited set of violating content but also from the billions of examples of non-violating content on our platform.

Progress in using AI to detect hate speech

Over the last several years, we’ve invested in building proactive detection tools for hate speech, so we can remove this content before people report it to us — and in some cases before anyone even sees it. Our detection techniques include text and image matching, which means we’re identifying images and strings of text that are identical to content that’s already been removed as hate speech. We also use machine-learning classifiers that look at things like the text in a post, as well as the reactions and comments, to assess how closely it matches common phrases, patterns, and attacks.

When we first deployed these systems to proactively detect potential hate speech violations, we relied on our content review teams to decide whether to take action. But by last spring, our systems were accurate enough to be used to remove posts automatically in some limited cases.

We’ve continued to improve our contextual classifiers for hate speech by incorporating a new bi-transformer text model and whole entity understanding as features and by using bias sampling to capture more false negatives, among other recent updates.

We created our whole entity understanding system as a single, generalized pretrained representation of content that can help address many different integrity problems. It works by understanding content across modalities, violation types, and even time in order to provide a more holistic understanding of a particular post or comment, for example. Our latest version is trained on more violations. The system improves performance across modalities by using focal loss, which prevents easy-to-classify examples from overwhelming the detector during training, along with gradient blending, which computes an optimal blend of modalities based on their overfitting behavior.

Pushing the state of the art with XLM-R

Earlier this year, we published our work on a new model called XLM-R, which uses self-supervised training techniques to achieve state-of-the-art performance in understanding text across multiple languages. We are working toward using XLM-R to help human reviewers analyze potential hate speech.

Our XLM-R model builds on one of the most important recent advances in self-supervised learning, the revolutionary Bidirectional Encoder Representations from Transformers (BERT) technique. BERT models are trained by taking sentences, blanking out words at random, and having the model learn to predict the most likely words to fill in the blanks. This process is known as self-supervised pretraining.

This graphic shows how our model is first pretrained by learning to predict blanked-out words.

XLM-R advances this method in two important ways:

We’ve developed a new pretraining recipe, RoBERTa, to train efficiently on orders of magnitude more data and for a longer amount of time.
We also have created NLP models that improve performance by learning across multiple languages. This method, called XLM, uses a single, shared encoder to train a large amount of multilingual data, generating sentence embeddings that work across a range of languages, and transfers knowledge effectively between them.

XLM-R incorporates the strengths of both XLM and RoBERTa to achieve the best results to date on four cross-lingual understanding benchmarks and outperform traditional monolingual baselines under some conditions.

Once the model is pretrained, we then fine-tune our model by using a much smaller amount of labeled data for the specific task.

Since XLM-R is primarily trained in a self-supervised way, we are able to train on large amounts of unlabeled data for languages for which it is hard to build a labeled dataset. Further, our research on cross-lingual models has also revealed language-universal structures, in which text with the same meaning in different languages is represented similarly internally by the model. This allows models like XLM-R to learn in a language-agnostic fashion, taking advantage of transfer learning to learn from data in one language (e.g., Hindi) and use it in other languages (e.g., Spanish and Bulgarian).

This graphic illustrates how hate speech in different languages is represented in a single, shared embedding space.

An additional advantage is that a single multilingual classifier can produce predictions for text in multiple languages, which simplifies the process of shipping and maintaining classifiers and allows us to iterate and improve more quickly.

Self-supervised models that understand content across languages, modalities, and tasks

To help reduce the prevalence of hate speech and scale to other areas, we need to further invest in deep content semantic understanding with multimodal learning. One promising direction we are exploring is post-level self-supervision, which combines the benefits of both whole post understanding and state-of-the-art self-supervised and weakly supervised pretraining techniques. By devising tasks and pretraining objectives that are specifically tailored toward the unique characteristics of the posts, we can achieve deeper post-level semantic understanding. This allows us to learn from many more examples, both of hate speech and benign content, unlocking the value of unlabeled data.

To advance in content understanding and keep our platform safe, we rely more on systems trained across multiple modalities using large amounts of unlabeled data.

By using transfer learning and cross-lingual models, we’ll also be able to leverage performance gains in one language or on one task and improve performance with many others. It also allows us to respond quickly to emerging problems by building on top of the self-supervised models.

We’re also launching new open initiatives like the Hateful Memes Challenge and accompanying dataset. These efforts will spur the broader AI research community to test new methods, compare their work, and benchmark their results in order to accelerate work on detecting multimodal hate speech. As with other open benchmarks and datasets, we believe we’ll all make faster progress collectively by comparing our techniques and results with those of others.

While AI isn’t the only answer to the challenge of hate speech and other harmful content, we are encouraged by the progress we’ve made and eager to do more.