Open-sourcing ReAgent, a modular, end-to-end platform for building reasoning systems

10/16/2019

Whether they’re designed to surface product recommendations or navigate busy highways, reasoning systems for real-world decision-making require some of the most sophisticated policies in machine learning. But despite advances in reinforcement learning (RL) and other reward-based approaches, learning through trial and error is difficult in unpredictable environments, and developing policies that can achieve complex objectives is often time- and resource-intensive. To overcome challenges like this, we are introducing ReAgent, a full suite of tools designed to streamline the process of building models that make and rely on decisions.

ReAgent is composed of these three resources:

Models that generate decisions and receive feedback on those decisions.
An offline evaluator module that estimates how new models will perform before they are deployed in production.
A serving platform for deploying models at scale, collecting the most useful feedback, and iterating quickly.

It’s the most comprehensive and modular open source platform for creating AI-based reasoning systems, and it’s the first to include policy evaluation that incorporates offline feedback to improve models. By making it easier to build models that make decisions in real time and at scale, ReAgent democratizes both the creation and evaluation of policies in research projects as well as production applications.

The toolkit, whose name comes from a portmanteau of “reasoning” and “agents,” is currently being used at Facebook to drive tens of billions of decisions per day. It is part of our broader efforts to advance the state of the art in RL — which have included enabling robots to teach themselves how to move and releasing an open source Go-playing bot that uses deep RL to beat professional human players. ReAgent’s use of feedback to improve models after they’ve been deployed makes RL more viable for large-scale applications.

Lessons learned in deploying reasoning systems at scale

ReAgent expands on Horizon, the first open source, end-to-end RL platform designed to optimize systems in large-scale production environments. Horizon is now part of ReAgent, along with additional resources to help researchers and engineers optimize or evaluate almost any decision-based model, whether it uses RL or other approaches.

One of the most important lessons we learned from using Horizon over the past year was that, though it provided resources for training production-ready RL models, and then improving them during deployment, the platform’s core library was more applicable to in-development models than existing ones. So we built ReAgent as a small C++ library that can be embedded in any application. And where Horizon focused on models that already had enough data to start improving through trial and error, ReAgent better addresses the difficulty of creating new systems, with included models that assist in beginning the process of gathering relevant learning data.

Previously, the open source RL model libraries available to researchers typically required them to fill in substantial gaps before evaluating and deploying their systems. Researchers and engineers have also had access to various decision services that offer extensive functionality, but these work more as services than toolkits, making their resources difficult to integrate into existing models or projects. ReAgent strikes a balance between these offerings, with modular tools that can be used for complete, end-to-end development of a new model, or to evaluate or optimize existing systems, avoiding the need to repeat substantial work with each new evaluation or implementation.

Turning actions into feedback, and feedback into training data

Each of ReAgent’s three primary resources — models, an evaluator, and a serving platform — can be used independently. The three components are most effective, however, when used in concert. Together they can enable researchers to develop and evaluate models that operate in a virtuous cycle of user requests, model-generated actions, and gather feedback that improves those models’ future actions.

When someone interacts with a ReAgent-enabled decision model, the serving platform sends a list of potential actions to training models, whose scores help rank an immediate action — such as recommending a product. ReAgent can also send suggested model changes to an offline evaluator, which tests updated decision plans before they’re incorporated into the deployed model.

ReAgent works by turning user inputs into training resources. When someone interacts with the system, such as clicking on a recommended project or suggested link, the ReAgent serving platform (RSP) responds by generating scores that rank a given set of related actions, according to preexisting decision plans. These plans are based on a collection of parameters determined by the models provided as part of ReAgent. This approach addresses the cold-start problem, where it is difficult or impossible to personalize actions without already having sufficient feedback — such as how often people respond to a given recommendation. ReAgent’s open source models aren’t a full substitute for such interaction, but they help make decision-based systems function well enough at launch to begin gathering relevant feedback.

But ReAgent’s most powerful asset is its ability to turn feedback into rewards, either to modify the online training models that optimize the RSP’s performance in real time or for offline training, which can lead to more substantial optimizations over longer periods of time. When feedback is sent offline, the evaluator analyzes it, then engineers determine whether to incorporate it into a set of decision configurations. Those configurations then update the RSP’s decision plans, adjusting the scores associated with future actions that it ranks.

In addition to using feedback to improve deep RL models, ReAgent also supports contextual bandits as well as multi-armed bandits, including traditional upper confidence bound models and the more recent perturbed-history exploration bandit we developed alongside researchers at Google Research and DeepMind. This versatility makes the platform applicable to systems that rely on making decisions, even if they aren’t using RL techniques.

Using modular design to optimize reasoning at every stage in a model’s development cycle

ReAgent’s modular design makes useful throughout the typical life cycle of models that rely on decisions, with each phase related to the data that’s available at the time. Typically, such models start with little to no data to work with and are therefore rules-based systems, with handwritten logic that helps gather relevant feedback on which to train them. Once the model is deployed, and some amount of data has been collected, it can be improved using a multi-armed bandit. For a global model, which doesn’t require additional personalization, developers can stop at this stage, having either finished or improved their model using ReAgent’s resources. To proceed to a personalized model, though, engineers would deploy the model again, so it could gather more contextual feedback, and use ReAgent’s contextual bandit or RL model to continue optimizing the system in real time.

This chart shows a typical workflow for creating decision models with ReAgent's end-to-end resources. At first, systems with no data must make decisions using handwritten rules, which can be updated across many machines in real time using ReAgent’s serving platform. With feedback, multi-armed bandits, which share a parameter store, can adapt to make better decisions over time. Finally, contextual feedback enables ReAgent to train contextual bandits and RL models, and deploy them using the TorchScript library in PyTorch, to produce more personalized actions.

Without ReAgent’s modules and overall end-to-end approach, progressing through each of these stages would often require setting up multiple projects. But to make our platform applicable across the entire sequence of creating and using decision models, we had to address that the serving systems at Facebook and other organizations are written in a variety of programming languages (C++, PHP, Python, etc.) and often use different logging services. Individually accommodating all of those languages and services could have required years of work. So when we started building our platform earlier this year, we decided to keep ReAgent’s footprint small. Rather than trying to unify those disparate systems, our streamlined library allows teams to plug in their own logging and model serving services, while we focused on delivering cutting-edge, feedback-based policy optimization.

Within Facebook, ReAgent is already being used at scale, deploying RL models to help ensure the timeliness and relevance of Facebook and Instagram notifications. The platform also makes tens of billions of decisions per day to provide more personalized experiences on News Feed, including increasing the diversity of sticker suggestions, while maintaining the rate of meaningful interactions. For example, rather than surfacing the same selection of popular stickers for everyone to use — a common problem for systems that respond only to rewards — models using ReAgent tend to present a broader range of suggested stickers. ReAgent’s ability to rank actions based on feedback makes its recommendations more personalized over time, potentially leading to suggestions for stickers whose use is uncommon in general but relevant for a specific person.

This ability to deliver accuracy as well as diversity is one of ReAgent’s most important benefits. Without it, the suggestions presented to someone could become narrow and devoid of exploration, such as product recommendations that are effectively identical to someone’s purchase history or a steadily smaller list of suggested stickers. ReAgent helps produce suggestions that are more varied, while still remaining generally relevant. And that same approach to diversity can be applied to research projects, including those that don’t involve recommendations but whose results could benefit from a platform that encourages exploration.

The future of ReAgent and AI-based reasoning systems

Although reasoning systems are in wide use across many industries, the use of RL and other ML-based techniques to move toward more personalized reasoning has been slowed by a lack of universal development tools. ReAgent provides those tools and is part of our continued investment in the open science and resources accelerating the deployment of systems that make more relevant decisions.

Our immediate plans for ReAgent are to continue using ML to improve personalization within Facebook services, replacing handwritten rules or broader reasoning, for example, with more relevant feedback-based predictions for what kinds of content should be surfaced for a given person. We also plan to make what’s now the most accessible end-to-end platform for building and using decision-based systems even more accessible, by adding documentation related to deploying the RSP to cloud services, such as Microsoft Azure. This expanded compatibility will further democratize the use of RL in production and expand the AI community’s understanding of how models can improve their ability to make and assess decisions, including in real time. Such advances could impact everything from e-commerce systems to developer infrastructure, including potentially enabling engineers to build multiple versions of a website and explore the impact of changes before deploying them.

In the long term, however, ReAgent has the potential to become the platform of choice for anyone creating models that make decisions, and evaluating those policies before and during deployment. WIth its versatility and end-to-end development tools, this platform could catalyze the use of large-scale personalized reasoning for cutting-edge research in a variety of AI fields, while also powering automated reasoning systems that firms use around the world.

To try ReAgent for yourself, check out our interactive tutorial; it provides a step-by-step breakdown of the platform’s benefits, including how ReAgent takes advantage of RL models, multi-armed bandits, and contextual bandits. You can also download the full release here.