Reimagining Meta’s infrastructure for the AI age

Hardware

Reimagining Meta’s infrastructure for the AI age

May 18, 2023 · 5 min read

Meta’s AI compute needs will grow dramatically over the next decade as we break new ground in AI research, ship more cutting-edge AI applications and experiences for our family of apps, and build our long-term vision of the metaverse.

We are now executing on an ambitious plan to build the next generation of Meta’s infrastructure backbone – specifically built for AI – and in this blog post we’re sharing some details on our recent progress. The projects we’re announcing here touch many of the layers of our hardware and software stack as well as the customized network that connects these technologies from top to bottom. They include our first custom chip for running AI models, a new AI-optimized data center design, and phase 2 of our 16,000 GPU supercomputer for AI research.

These transformational efforts — and additional projects still underway — will enable us to develop much larger, more sophisticated AI models and then deploy them efficiently at scale. AI is already at the core of Meta’s products, enabling better personalization; safer, fairer products; and richer experiences, while also helping businesses reach the audiences they care about most. We are even reimagining how we code — deploying Code Compose, a generative AI–based coding assistant developed in-house at Meta as a key tool to make our developers more productive throughout the software development life cycle. By rethinking how we innovate across our infrastructure, we’re creating a scalable foundation to power emerging opportunities in the near term in areas like generative AI, and in the longer term as we bring new AI-powered experiences to the metaverse.

For more on the AI investments shared in this post, check out the Meta AI Infra @Scale page.

AI at the heart of our infrastructure

Ever since we broke ground on our first data center back in 2010, Meta has built a global infrastructure that today serves as the engine for the more than 3 billion people who use Meta’s family of apps each day. AI has been an important part of these systems for many years — from our Big Sur hardware in 2015 to our development of PyTorch to our initial deployment last year of Meta’s supercomputer for AI research. We’ve now advanced our infrastructure in exciting new ways:

MTIA is Meta’s first in-house, custom accelerator chip family targeting inference workloads. MTIA provides greater compute power and efficiency than CPUs, and it is customized for our internal workloads. By deploying both MTIA chips and GPUs, we’ll deliver better performance, decreased latency, and greater efficiency for each workload.

Meta’s next-generation data center design will support our current products while enabling future generations of AI hardware for both training and inference. This new data center will be an AI-optimized design, supporting liquid-cooled AI hardware and a high-performance AI network connecting thousands of AI chips for data center–scale AI training clusters. It will also be faster and more cost-effective to build, and it will complement other new hardware, such as Meta’s first in-house-developed ASIC solution, MSVP, which is designed to power the constantly growing video workloads at Meta.

Meta’s RSC, which we believe is one of the fastest AI supercomputers in the world, was built to train the next generation of large AI models to power new augmented reality tools, content understanding systems, real-time translation technology, and more. It features 16,000 GPUs, all accessible across the three-level Clos network fabric that provides full bandwidth to each of the 2,000 training systems. Over the past year, RSC has been powering research projects like LLaMA, the large language model Meta built and shared earlier this year.

These AI-focused efforts enable us to take advantage of exciting new software advances like PyTorch 2.0. The latest version of this open source AI framework, which was created by Meta in 2016 in partnership with the AI community, offers the same powerful, flexible, easy-to-use workflow. But it fundamentally changes and accelerates how the framework operates at the compiler level under the hood. With 2.0, PyTorch now provides faster performance and support for new features, like accelerated transformers and dynamic shapes.

The benefits of an end-to-end integrated stack

Custom-designing much of our infrastructure enables us to optimize an end-to-end experience, from the physical layer to the software layer to the actual user experience. We design, build, and operate everything from the data centers to the server hardware to the mechanical systems that keep everything running. Because we control the stack from top to bottom, we’re able to customize it for our specific needs. For example, we can easily colocate GPUs, CPUs, network, and storage if it will better support our workloads. If that in turn means we need different power or cooling solutions, we can rethink those designs, as well, as part of one cohesive system.

This will only be more important in the years ahead. Over the next decade, we’ll see increased specialization and customization in chip design, purpose-built and workload-specific AI infrastructure, new systems and tooling for deployment at scale, and improved efficiency in product and design support. All of this will deliver increasingly sophisticated models built on the latest research — and products that give people around the world access to this emergent technology.

Meta has always focused on delivering long-term value and impact to guide our infrastructure vision. We believe our track record of building world-class infrastructure positions Meta to continue to lead in AI over the next decade and beyond, and the work we’ve discussed here will have a significant impact on our Family of Apps today and metaverse initiatives tomorrow.

We look forward to sharing more updates on our work to harness AI’s immense potential in new ways to benefit billions of people. For more on the AI investments shared in this post, check out the Meta AI Infra @Scale page.