Research

MultiRay: Optimizing efficiency for large-scale AI models

November 18, 2022

In current AI systems that process text, images, and other modalities, the best possible results are obtained by taking a very large model trained on an immense amount of data and then specializing it on a particular task (for example, identifying harmful speech). This produces an extremely high-quality, extremely expensive one-trick pony. If you have many problems to solve, you’ll need many models, and the cost of operating so many large models will rapidly spiral out of control. This means that in practice, state-of-the-art large models are rarely used in production, and real-world models are often much smaller and simpler.

What if you could compute the expensive part of understanding content with AI once but reuse this result (known as an embedding) across multiple tasks? As part of our push to make our AI systems more efficient, we’ve developed MultiRay, a new platform for running state-of-the-art AI models at scale. MultiRay allows multiple models to run on the same input, and share the majority of the processing costs while incurring only a small per-model cost. Doing this helps us optimize the total cost of performing these AI tasks. We can more easily introduce AI accelerators due to the concentration of company-wide computation into a single model, and we can also trade off between compute power and storage at the company level.

MultiRay’s universal models are trained to perform well across a wide set of tasks and domains. Such a jack-of-all-trades model delivers better quality than the much smaller per-task specialized models we used previously. With MultiRay, teams across Meta can more quickly improve and iterate on machine learning (ML) models for myriad applications, ranging from topic tagging of posts to hate speech detection. These tasks can also be achieved with better efficiency and less human effort than if each team were to build large end-to-end models from scratch.

MultiRay’s first model, TextRay, has been in production since 2020 and supports text understanding applications, such as detecting inauthentic content and improving users’ search experience.

More modalities, more problems

Text is a good start, but the real world is more complex, incorporating many modalities. A Facebook post, for example, might contain text, images, and video. To understand a post, a system needs to analyze each of these elements separately and in context of the others. But doing this means combining several models that are already compute-intensive into a larger, even more intensive model. The resulting increase in compute and power consumption slows down our efforts to bring the most advanced ML models into production for our products and services.

PostRay, MultiRay’s second model, brings together text and image understanding into the same model. Since posts across FB and IG often contain both text and image data, PostRay reduces the need for teams to have their own text and image understanding. PostRay has several use cases across Meta, including topic classification, which is used for Reels.

PostRay models, because they incorporate cutting-edge research in multiple fields simultaneously, are more complex to train, deploy, and maintain. With MultiRay, we only have to do these tasks a single time, and the whole company reaps the benefits. A centralized system serving a jack-of-all-trades model allows us to work directly with cutting-edge research teams and bring their work to production soon after it is published.

How MultiRay works

MultiRay’s primary aim is to democratize access to large foundational models at Meta. It does so by centralizing the execution on accelerators like GPUs and using a cache to save on cost of recomputation as much as possible. Currently, MultiRay powers over 125 use cases across Meta, and it supports up to 20 million queries per second (QPS) while serving 800 billion queries per day.

What are embeddings?

MultiRay uses large foundational models that return a point in a high-dimensional vector space that represents the input. This point is called an embedding and is a more ML-friendly version of the original input. Instead of processing the raw input (such as text and images), task-specific models can consume the embedding from MultiRay, which is much simpler to handle. The foundational models deployed in MultiRay are optimized to work for a variety of tasks, including similarity and classification. This universality makes our embeddings quite large (many kilobytes) so as to convey more information.

Why centralize large models?

Amortization across many teams

Large models and latency constraints demand execution on accelerators like GPUs. Accelerators (specialized hardware) are in high demand across Meta, and even with them, state-of-the-art models consume a lot of energy to train and host. MultiRay’s client teams split the bill for training and hosting these large models, as the same hardware and processing can be used multiple times. These are much larger and higher quality than what each team could have hosted alone. In this case, the whole is greater than the sum of the parts.

Simpler development and operations

Generally, teams across Meta are responsible for their own models, infrastructure, and model upkeep. As the models grow in size, it places an increasing operational burden on each team to train and serve them. It also makes it harder to apply sophisticated optimization techniques to models spread across a variety of teams. MultiRay serves a small number of large centralized models, allowing a single team to handle the majority of the operations and optimization. Client teams own smaller, task-specific models that are easier to manage. This allows many teams that didn’t have the bandwidth to train, deploy, and manage cutting-edge AI to use that technology.

Faster research to production: Single-point acceleration

Since MultiRay is a centralized service used by over 125 clients, improvements benefit all the clients. As a result, MultiRay has become a sandbox for our ML and systems specialists to contribute key optimizations that support the broader PyTorch and accelerator ecosystem. MultiRay, for example, was the first large use case to deploy PyTorch’s BetterTransformer in production at Meta. This brought significant capacity savings with no impact on quality.

Efficiency on accelerators: Cross-request batching

Accelerator hardware is most efficient when it processes an aggregated group of requests in parallel, known as a batch. Optimal batching of requests allows increasing throughput of the service without causing undue latency. Batch construction adds complexity to our internal clients, and the ideal batch can change with new hardware or models.

To keep things simple for our internal users, the MultiRay external API is for a single request at a time. MultiRay then internally uses a cross-request batching logic to aggregate many concurrent requests across clients into a batch. This allows us to write the logic once and tune it to create ideally sized batches for the model and hardware. This batching is completely hidden from the clients sending the requests, even when we make major performance changes, such as the larger batch size used by migration to the new generation of GPU accelerator hardware.

Cache: Trade-off compute and storage

MultiRay utilizes a cache to save on cost of recomputation as much as possible. It is a multilayered cache to minimize cost and latency, with each layer bringing more hit rate, at the cost of lower speed. The layers start from a fast but small per-host local cache in the RAM of every MultiRay server, and they end with a slower but much larger globally distributed cache in flash memory.

The MultiRay models are large, and they produce large embeddings (many kilobytes) to preserve universality. For text understanding, these embeddings are much larger than the inputs themselves! It takes less energy to serve an embedding out of cache than to recompute it, but it’s not zero. Since the cache storage available is finite, it is not possible to cache the results for a long time.

MultiRay measures the request patterns across clients to figure out the best cache settings (size, time-to-live, update policies) to reduce the total cost of the service. For example, we use these measured data to simulate the energy required for various cache lifetime settings, trading off the cost of recomputation of a request on accelerators versus serving it from cache. This feedback loop allowed us to improve the efficiency of MultiRay even while client behavior constantly changes.

No free lunch: The challenges of a centralized service

A centralized service used across Meta comes with many challenges. Some of the challenges, such as client management, quotas, and cost attribution, considered solved problems for large-scale systems like databases, had to be adapted for the AI domain. Query size and cache hit rate both affect the energy required to process queries, so quotas are more complex. Additionally, sharing the expense of the higher-quality, more expensive MultiRay models only works if our models are widely used, which requires models to offer state-of-the-art quality across many use cases. This moving target means that we made heavy investments in model refresh (versioning, upgrades to newer versions, and deprecations of older versions) and innovating new model architectures and training flows to reduce research to production time and keep MultiRay users on the latest technology.

Learn more about MultiRay

If you’re curious about MultiRay, we encourage to you take a look at the research from Meta’s Foundational AI Research (FAIR) team that led to its development:

Unsupervised cross-lingual representation learning at scale — where researchers first demonstrated that multilingual modeling can be done without sacrificing per-language performance.

General purpose text embeddings from pre-trained language models for scalable inference — where researchers demonstrate a solution for NLP in which multiple tasks are performed on the same text using large-scale pre-trained models at a fraction of the compute cost.

Multiscale vision transformers and Masked autoencoders as spatiotemporal learners — foundational research pointing toward how MultiRay can be applied to video-related tasks in the future.

Acknowledgements

We would also like to thank Abhinandan Krishnan, Anshul Verma, Daniel Ho, Davis Liang, Emily Shen, Evan Trippler, Hanchao Yu, Harley Boughton, Harsha Naidu K, Jafar Taghiyar, Michael Saha, Philippe Brunet, Rui Hou, Ruty Rinott, Shreya Goyal, Victor Dogaru, Akram Baharlouei, Charles Bai, Chenyang Yu, Jeff Wang, Manisha Jain, Marvin Wang, Maxim Grechkin, Michael Wu, Rao Bayyana, and Ves Stoyanov, who helped make this happen.

Written By

Nikhil Gupta

Software Engineer

Michael Gschwind

Software Engineer

Don Husa

Software Engineering Manager

Christopher Dewan

Software Engineer

Madian Khabsa

Manager, Research Science