May 18, 2023 · 8 min read
Meta has completed the second-phase buildout for our Research SuperCluster (RSC), making it one of the fastest AI supercomputers in the world.
At full strength, we achieve almost 5 exaflops of computing power. (One exaflop is a quintillion — that’s a billion billion — calculations per second.)
Just how massive is that? You’d have to perform one calculation every second for 31,688,765,000 years to match what a 1 exaflop computer system can do in just one second.
This level of performance is achieved through the use of 2,000 NVIDIA DGX A100 systems as RSC’s compute nodes, for a total of 16,000 NVIDIA A100 Tensor Core GPUs, connected via an NVIDIA Quantum InfiniBand 16 Tb/s fabric network.
As a highly reliable training cluster, RSC is enabling us to conduct research at an unprecedented scale and speed. This allows us to continuously run experiments and deliver faster outcomes.
Today, we are spotlighting a few of the many projects that run on RSC and sharing how RSC has enabled us to accelerate our research across a diverse set of research areas.
Training large AI models is traditionally a resource- and time-intensive task. By adding an unprecedented number of GPUs to the mix, we can significantly reduce the time it takes to train and fine-tune a model.
There are three elements that influence the rate at which a model trains: tokens, which are the number of words, numbers, and phrases in the training dataset, the number of parameters, and the GPUs used for training. In the animation below, observe how a higher number of parameters and tokens increases the time needed from hours to months. As the number of GPUs allocated increases, that training time can be drastically reduced. We’ve used this massive scale to our advantage over the past year to train a number of projects that are already making an impact.
Large language models, when trained properly, have the potential to offer substantial benefits to billions of people. We used RSC to train LLaMA (Large Language Model Meta AI), a foundational 65-billion-parameter large language model that we shared as a gated release to the research community. Our goal was to provide access to a smaller, more performant model that researchers could study and fine-tune for specific tasks without needing significant hardware.
Foundation models train on large sets of unlabeled data. We trained LLaMA 65B and the smaller LLaMA 33B on 1.4 trillion tokens. Our smallest model, LLaMA 7B, trained on one trillion tokens. Once again, the ability to run at scale allowed us to accelerate training and tuning iterations, and allowed us to release much faster than we otherwise would have. One important aspect of these models is how fast they were able to process tokens, which are words or numbers. Our largest model, LLaMA 65B, was trained on 2,048 NVIDIA A100 GPUs. With this capacity, the model trained on 380 tokens per second per GPU in just 21 days.
trillion tokens in just 21 days
FAIR’s NLLB-200 machine translation model, which translates across 200 languages, was a groundbreaking release that was made possible by RSC.
We were able to leverage the capacity of RSC to decrease training times from one month to seven to 10 days, resulting in better accuracy and quality of the model. The large number of GPUs and network performance allowed us to run more iterations and improve our tuning faster before we publicly shared details of NLLB.
Research like NLLB is core to Meta’s goal of bringing the world closer together. In the future, NLLB will help us drive research advancements and contribute to making Meta content available to many people in their native languages.
AI has enabled incredible advancements in translation, but until now, much of that progress has focused on languages with a rich history of written texts. While these advancements are important, we also wanted to explore how AI could help create translations for a language that is primarily spoken and without a standardized writing system.
We used RSC to train the world’s first AI-powered translation system for a primarily oral language, enabling Hokkien speakers to hold conversations with English speakers in real time. While the language is widely spoken within the Chinese diaspora, it does not have a standard written form. Since traditional machine translation tools use a large amount of text to train an AI model, we had to look beyond standard techniques (using Wikipedia, public Internet crawls, Gutenburg book downloads, and other sources often in English). RSC enabled us to cut times for pretraining models in half by running on RSC, moving from an average of two months to just under one month.
reduction in training time
Teaching AI to solve advanced mathematical problems is an important step toward building intelligent machines. Using HyperTree Proof Search (HTPS), we trained a neural theorem prover on a dataset of successful mathematical proofs, with the goal of creating a system that can solve International Math Olympiad (IMO) problems. Having capacity on RSC helped us accelerate our progress. We increased the training scale to 2,000 GPUs, which helped us finish the project significantly ahead of schedule. Our system can solve 10 IMO problems — 5x more than any previous AI system. In keeping with our approach to open science, we publicly released our model through the Lean Visual Studio Code (VSCode) plugin and shared additional details in a research paper.
The first stage of building RSC was experimental. As we continued to build out the supercomputer to its full capacity, we were also looking at the performance of early projects, how to best manage the allocation of GPUs, and lessons we could learn for future success.
Early adopter projects helped us codify the lessons learned and best practices as we moved closer to completing the build-out of RSC. We learned a great deal about how best to allocate capacity to our research teams, adopting a dynamic QoS model that has helped reduce resource contention for our 16K GPUs.
We use RSC’s scale to run many projects concurrently by selectively onboarding large-scale workloads combined with many smaller projects. This translates into faster iterations, faster tuning cycles, and faster time to completion. This increase in completion time, coupled with the vast number of GPUs available, means that RSC is positioned to accelerate all of Meta’s research efforts and pursue large model efforts.
Working in partnership with our implementation partner, Penguin Computing, we improved our overall cluster management. By the time we completed the second phase of building RSC, availability stayed above 95 percent on a consistent basis. This was no small feat given that we added a 10K GPU cluster while concurrently running multiple research projects. We now have a template for building large GPU clusters that is repeatable and reliable.
As of 2023, increased GPU availability has allowed some of our major projects that run on thousands of GPUs to sustain training for weeks at a time. That stability is important to RSC’s success, especially when considering how much data our models ingest and train on every day.
The RSC environment is served by our secure and scalable storage solution, AirStore. We worked with our partners Penguin Computing and Pure Storage to deploy the underlying hardware for AirStore, which consists of 80 PB of cache and over a half exabyte of bulk storage, offering up to 16TB/s of throughput.
petabytes of bulk storage
AirStore is designed to provide the performance, scale, and security necessary to facilitate the use of de-identified data from Meta’s production platforms. AirStore architecture also ensures a scalable distribution of the training data, and a fetching and preprocessing stage. This minimizes the amount of cross-regional traffic and the number of touches between training runs. AirStore presents data to the training machines via a local cache component, which minimizes data operations latencies. AirStore ensures that training data is audited, encrypted, and logged at all times.
With RSC, one of the many important goals we have is to leverage real-world data from our platform as we develop new models. For example, identifying harmful content requires a large amount of real-world data based on the kinds of content being posted. Otherwise, such identification becomes extremely difficult.
Before data is imported into the RSC, it must go through a privacy review process to confirm that appropriate privacy safeguards have been put in place to protect the data. The data is then encrypted before it can be used to train AI models, and both the data and the decryption keys are deleted regularly to ensure older data is not still accessible. Since the data is decrypted at only one endpoint, in memory, it is safeguarded even in the unlikely event of a physical breach of the facility.
As we continue to conduct research, the RSC will be an important environment to support our efforts in new and emerging areas, like generative AI. With the RSC, we have an expanded capability to train large language models as we look to take advantage of new opportunities in this area across our family of apps and consumer projects. Our progress with LLaMA has already laid the foundation for what we can achieve with the RSC’s capacity and speed.
We are also entering the most complex paradigm shift in computing that we have ever undertaken: building the metaverse. To do this, we need an AI supercomputer that can run at capacity with the largest, fastest, and greatest number of modalities. Imagine everything that goes into playing a game in virtual reality, from hitting a ping pong ball to your opponent or having a staff meeting in Horizon Workrooms.
There’s already a lot that needs to happen to ensure a smooth experience, but in the metaverse, that work will be even more complex, and it will require additional computing power. We will get there, but the work will be complicated. That’s why we are focusing on how RSC will help us as we continue to tackle these challenges. This includes helping people communicate in a rich new environment — no matter what language they speak — and also enabling AI to understand nonverbal cues, such as rendering a smile accurately in the metaverse. All of these are a part of the bigger picture work we are doing at Meta, and RSC will be an integral part of helping us get there.
Reimagining Meta’s infrastructure for the AI age
Meta is executing on an ambitious plan to build the next generation of its infrastructure backbone – specifically for AI. This includes our first custom chip for running AI models, a new AI-optimized data center design, and phase 2 of our 16,000 GPU supercomputer for AI research.
MTIA v1: Meta’s first-generation AI inference accelerator
In 2020, we initiated the Meta Training and Inference Accelerator (MTIA) family of chips to support our evolving AI workloads, starting with an inference accelerator ASIC for deep learning recommendation models (DLRMs).
MSVP: Meta’s first ASIC for video transcoding
The Meta Scalable Video Processor (MSVP) will support video on demand and live streaming, as well as generative AI and AR/VR content.