ML Applications

Integrity

Training AI to detect hate speech in the real world

November 19, 2020

Something Went Wrong
We're having trouble playing this video.

AI systems are only as good as the data they are trained with. If they don’t have the right mix of examples to learn from, even sophisticated models will struggle. This is especially true and important when building tools to deal with hate speech. Hate speech varies from country to country or even group to group — and it can evolve rapidly, drawing on current events or trends. People also try to disguise hate speech with sarcasm and slang, intentional misspellings, and sophisticated photo alterations. And effective tools have to not just find problems but also avoid mistakes. A false positive can prevent someone from communicating with friends or others in their community.

To address these challenges and better protect people from content that incites violence or violates our policy on hate speech, we’ve built and deployed an innovative system called Reinforcement Integrity Optimizer (RIO). RIO is an end-to-end optimized reinforcement learning (RL) framework. It’s now used to optimize hate speech classifiers that automatically review all content uploaded to Facebook and Instagram.

AI classification systems are typically trained offline. Engineers choose a fixed dataset and use it to teach their model. They then deploy it to production. But often what works well in the controlled test environment isn’t the best choice for the real world.

RIO takes a new approach. It guides the model to learn directly from millions of current pieces of content. And it uses online metrics as reward signals to optimize AI models across all aspects of development: data, feature, architecture, and parameters. It constantly evaluates how well it’s doing its job, and it learns and adapts to make our platforms safer over time.

Something Went Wrong
We're having trouble playing this video.

This animation shows RIO’s end-to-end optimization.

We deployed RIO at the end of Q3 2020, so our next Community Standards Enforcement Report will help us assess its impact. (The most recent report is available here.)

The downsides of a fragmented system to train and deploy models

In typical AI-powered integrity systems, prediction and enforcement are two separate steps. An AI model predicts whether something is hate speech or an incitement to violence, and then a separate system determines whether to take an action, such as deleting it, demoting it, or sending it for review by a human expert. While content integrity teams assess the overall effectiveness of the system, engineers building the classification models focus only on improving prediction. In other words, the model is not enforcement-aware.

This approach has several significant drawbacks. Better predictions on their own may not improve how well the system protects people from harmful content in the real world. For example, a system might be good at catching hate speech that reaches only very few people but fail to catch other content that is more widely distributed. Moreover, what works well in a controlled experiment with limited data may fare much worse in production. Limited training data can lead to overfitting, where the model performs well on its training dataset but not when given new, unfamiliar content to classify.

Finally, the relative scarcity of examples of hate speech and other harmful content poses an additional challenge for training. We need to be able to learn from not only the limited set of violating content but also the billions of examples of nonviolating content on our platform.

How RIO unifies this process

A better approach is to take the bottom-line results — how well the system did in protecting people from seeing hate speech — and use them to directly optimize the AI model end to end. This is exactly what RIO does.

The framework’s data sampler estimates the value of the training examples, deciding which ones will produce the most effective hate speech classifier. We are working to deploy additional RIO components: A model optimizer to enable engineers to write a customized search space of parameters and features; a deep reinforced controller to generate the candidate data sampling policy, features, and model architecture/hyperparameters; and an enforcement and ranking system simulator to provide the right online signals for each candidate generated from the deep reinforced controller.

Expanding RIO to tackle other types of harmful content

Preventing the spread of hate speech requires more than just advanced AI model architectures and scores of GPUs to crunch the numbers. Even the most powerful system can’t be effective if it’s not learning from the right examples.

With RIO, we don’t just have a better sampling of training data. Our system can focus directly on the bottom-line goal of protecting people from seeing this content. Moreover, we can use with innovative self-supervised learning models like Linformer. That combination — better optimization and more efficient models — is already making a difference in our production systems.

We hope to expand RIO to tackle other kinds of harmful content and to use additional online reward signals to guide it. And we’re encouraged to see how reinforcement learning — an approach that has primarily excelled in games and research projects to this point — can be applied in a real-world environment where it can make a difference in people’s lives. AI has enabled us to detect more kinds of hate speech violations, quicker, and with greater accuracy. We have much more work to do, but our technology has already improved a lot in relatively little time, and we continue to look for new ways to use AI to protect the people who use our products.