Inside Meta's AI optimization platform for engineers across the company

April 21, 2022

AI is an important part of making modern software systems and products work as well as possible, from improving the user experience to making the compute infrastructure more efficient. Whether it’s reducing latency, improving the quality of a video stream, or streamlining the interfaces to match a particular person’s needs, AI today is often more effective than even carefully constructed human-crafted heuristic strategies. But to leverage AI more effectively in our products, we need to address several challenges: the system must accommodate software engineers without backgrounds in machine learning; it must provide mechanisms to optimize for many different product goals, which may differ from closed-form machine learning loss functions; it must distinguish causal connections from correlations in the data; and it must scale efficiently to train, host, and monitor large numbers of AI models.

To address these needs at Meta, we’ve built an end-to-end AI platform called Looper, with easy-to-use APIs for optimization, personalization, and feedback collection. Looper supports the full machine learning lifecycle from model training, deployment, and inference all the way to evaluation and tuning of products. Rather than rebuild our existing products around AI models, Looper enables us to upgrade them to use AI for personalized optimizations. The Looper platform currently hosts 700 AI models and generates 4 million of AI outputs per second.

Making smart strategies available to applications

Meta’s different services are used by billions of people every day, each of whom has different interests and preferences. Looper enables us to customize many of these “out of the box” at an unprecedented scale without requiring complex, specialized code.

Overloading someone using a product with dozens of choices in a UI menu can make a product unappealing no matter how much value it offers. But menu preferences vary among different people. Likewise, opportunistically prefetching content likely to be viewed by a user to a mobile device may greatly improve the user experience of our product, but doing this without overwhelming the hardware resources of the device t requires accurately predicting what will be of most interest.

To support real-time smart strategies in a scalable way, Looper offers several features:

  • Looper targets ease of use and rapid deployment of models for use cases with moderate data sizes and model complexity.

  • It supports a variety of model types, hosts and trains numerous models and decision policies.

  • It supports a wide selection of machine learning tasks (classification, estimation, value and sequence prediction, ranking, planning) via its ability to use either supervised or reinforcement learning. Combined with model management infrastructure, our automation tools (AutoML) select models and hyperparameters to balance model quality, size, inference time, etc. Looper covers the scope from data sources to product impact, evaluated and optimized via causal experiments.

  • It is a declarative AI system, which means that product engineers only need to declare the functionality they want and the system fills in the software implementation based on the declaration. Internally, Looper relies on our strategy blueprint abstraction, which combines configurations for features, labels, models, and decision policies into one, and maintains multiple versions of such joint configurations. This supports more comprehensive optimization, captures compatibility between versions, and enables coding-free management of the full lifecycle of smart strategies. Blueprints enable vertical optimizations of black-box product metrics using a powerful experiment optimization system.

  • While other AI platforms often perform inference offline in batch mode, Looper operates in real time.

  • Many AI systems work with uniform data, such as pixels or text, but different products often have very different metadata, often coming from different sources. Moreover, patterns in metadata change quickly, necessitating regular retraining of AI models on fresh data.

  • A/B testing to evaluate many different types of models and decision rules, including those used by contextual bandits, to model uncertainty in predictions across one or more objectives, or reinforcement learning, to optimize long-term, cumulative objectives.

  • Unlike traditional end-to-end AI systems, Looper enables engineers and others at Meta to track how a model is actually used in the software stack and experiment on all aspects of the modeling framework – all the way from metric selection to policy optimization. To do this, Looper extends the common definition of end-to-end into the software layer, so that model architecture, feature selection parameters can be optimized in a multiobjective tradeoff between model quality and computational resources. To optimize long-term product goals an engineer can adjust how much importance is placed on different inputs when making real-time decisions. Our platform makes it possible to optimize these and other parameters using AutoML techniques applied to the entire pipeline.

The Looper platform for deploying smart strategies

Unlike heavyweight AI models for vision, speech and natural language processing, which favor offline inference with batch processing, Looper works with models that can be re-trained and deployed quickly in large numbers on shared infrastructure. Our platform interprets user-interaction and system-interaction metadata as either labels for supervised learning or rewards for reinforcement learning.

Looper pursues fast onboarding, robust deployment, and low-effort maintenance of multiple smart strategies where positive impacts are measured and optimized directly in application terms. Application code is separated from platform code, and Looper leverages existing horizontal AI platforms, such as PyTorch and Ax, with interchangeable models for machine learning tasks.

To make smart strategies successful, we need a way to evaluate them and improve them when results are not sufficiently good. Such evaluation is performed based on product metrics. In some cases, each decision can be checked, so that good and poor decisions can be used as examples on which a smart strategy learns (via supervised learning). However, some product metrics track long-term objectives (such as active daily users) that cannot be tracked down to specific decisions. Both cases can be handled by Looper, and using live data is particularly important. Access to Meta’s monitoring infrastructure helps detect unforeseen side effects. On our platform, product developers define the decision space, allowing the platform to automatically select model type and hyperparameter settings. The models are trained and evaluated on live data without user impact, and improved until they can be deployed. Newly trained models are canaried (deployed on shadow traffic) before product use – such models are evaluated on a sampled subset of logged features and observations, and offline quality metrics (e.g., MSE for regression tasks) are computed. This helps avoid degrading model quality when deploying newer models.

Adoption and impact of smart strategies

Our vertical machine learning platform hosts moderate-sized models from horizontal platforms so as to improve various aspects of software systems. These models are deployed with little engineering effort and maintained without model-specific infrastructure. Looper is currently used by 90+ product teams at Meta that deploy 690 models that make 4 million predictions per second.

Application use cases fall into five categories, in decreasing order of frequency:

  • Personalized Experience is tailored based on the user's engagement history. For example, a product may display shopping-related content prominently only to those likely to use it (but such content is accessible to all users through menus).

  • Ranking orders items to improve user utility, e.g. to personalize a feed of candidate items for the viewer.

  • Prefetching/precomputing data/resources based on predicted likelihood of usage (Section 4.1).

  • Notifications/prompts can be sent only to users who find them helpful.

  • Value estimation predicts regression tasks, e.g., latency or memory usage of a data query.

The figure below compares resource consumption (the number of servers on the y-axis) by resource categories for active Looper use cases.

The spectrum of AI expertise varied across product teams from beginners to experienced AI engineers, and only 15 percent of teams using the Looper platform include AI engineers. For teams without production AI experience, an easy-to-use AI platform is often the deciding factor for adoption, and AI investment continues upon evidence of utility. Our platform handles concerns about software upgrades, logging, monitoring, etc behind high-level services and unlocks hefty productivity improvements. For experienced AI engineers, a smart-strategies platform improves productivity by automating repetitive time-consuming work: writing database queries, implementing data pipelines, setting up monitoring and alerts. Compared to narrow-focus systems, it helps product developers launch more AI use cases. Regardless of prior AI experience, successful platform adopters configured initial machine learning models in just a couple days, quickly started collecting training data, and then refined their models and launched new products within just months.

Making AI at scale easier for engineers and product developers

Significant opportunities exist to embed self-optimizing smart strategies for product decisions into software systems, so as to enhance user experience, optimize resource utilization, and support new functionalities. Our AI platform Looper addresses the complexities of product-driven end-to-end machine learning systems and facilitates at scale deployment of smart strategies. It offers immediate, tangible benefits in terms of data availability, easy configuration, judicious use of available resources, reduced engineering effort, and ensuring product impact. Platform adopters are particularly attracted by extensive support for product impact evaluation via causal inference and measurements of resource overhead.

Looper makes smart strategies more easily accessible to software engineers and enables product teams to build, deploy and improve AI-driven capabilities in a self-serve fashion without AI expertise. We will continue to develop the platform so that we can leverage AI in new ways to improve Meta’s products and services.

For more technical details, see our paper on Looper: An end-to-end ML platform for product decisions

Written By

Igor Markov

Research Scientist

Norm Zhou

Software Engineering Manager