Research

Faster, more flexible inference on GPUs using AITemplate, a revolutionary new inference engine

Oct. 3, 2022

GPUs play an important role in the delivery of the compute needed for deploying AI models, especially for large-scale pretrained models in computer vision, natural language processing, and multimodal learning. Currently, AI practitioners have very limited flexibility when choosing a high-performance GPU inference solution because these are concentrated in platform-specific, and closed black box runtimes. A machine learning system designed for one technology provider’s GPU must be completely reimplemented in order to work on a different provider’s hardware. This lack of flexibility also makes it difficult to iterate and maintain the code that makes up these solutions, due to the hardware dependencies in the complex runtime environments.

Moreover, AI production pipelines often require fast development. Developers are eager to try novel modeling techniques because the field is advancing rapidly. Although proprietary software toolkits such as TensorRT provide ways of customization, they are often not enough to satisfy this need. Furthermore, the closed, proprietary solution may make it harder to quickly debug the code, reducing development agility.

To address these industry challenges, Meta AI has developed and is open-sourcing AITemplate (AIT), a unified inference system with separate acceleration back ends for both AMD and NVIDIA GPU hardware. It delivers close to hardware-native Tensor Core (NVIDIA GPU) and Matrix Core (AMD GPU) performance on a variety of widely used AI models such as convolutional neural networks, transformers, and diffusers. With AIT, it is now possible to run performant inference on hardware from both GPU providers. We’ve used AIT to achieve performance improvements up to 12x on NVIDIA GPUs and 4x on AMD GPUs compared with eager mode within PyTorch.

AITemplate is a Python framework that transforms AI models into high-performance C++ GPU template code for accelerating inference. Our system is designed for speed and simplicity. There are two layers in AITemplate — a front-end layer, where we perform various graph transformations to optimize the graph, and a back-end layer, where we generate C++ kernel templates for the GPU target. In addition, AIT maintains a minimal dependency on external libraries. For example, the generated runtime library for inference is self-contained and only requires CUDA/ROCm runtime environments. (CUDA, NVIDIA’s Compute Unified Device Architecture, allows AI software to run efficiently on NVIDIA GPUs. ROCm is an open source software platform that does the same for AMD’s GPUs.)

Our project offers many performance innovations, including advanced kernel fusion, an optimization method that merges multiple kernels into a single kernel to run them more efficiently, and advanced optimizations for transformer blocks. These optimizations deliver state-of-the-art performance by significantly increasing utilization of NVIDIA's Tensor Cores and AMD's Matrix Cores.

AITemplate is currently enabled on NVIDIA's A100 and AMD’s MI200 GPU systems, both of which are widely used today in data centers from technology companies, research labs, and cloud computing service providers.

The benchmark results shown below compare the performance results of PyTorch eager mode and AITemplate on NVIDIA A100 GPUs for several mainstream models.

AIT and PyTorch eager* ResNet-50 and BERT-Base with sequence length 384 benchmark on NVIDIA A100-40GB.

As shown in the benchmark below, with AITemplate, models utilizing the AMD MI250 GPU can get significant performance boosts as well, including ResNet and transformer models that power advanced vision and language systems. On MI250 2 GCD settings, each GCD (core) is processing half of the data.

PyTorch eager and AIT ResNet-50, BERT-base with sequence length 384 Benchmark on AMD MI250. MI250 is running in data parallel mode, where each GCD (GPU core) is processing half of the data. For batch size 1, the batch is processed on a single GCD while the other GCD is idle.

The unified GPU back-end support gives deep learning developers more hardware vendor choices with minimal migration costs.

Deploying AITemplate is straightforward. The AI model is compiled into a self-contained binary without dependencies. This binary can work in any environment with the same hardware and newer CUDA 11 / ROCM 5 versions, which results in excellent backward compatibility. This is important in production environments, where stability and backward compatibility are crucial. AITemplate also provides out-of-the-box widely used models (e.g., VisionTransformer, BERT, Stable Diffusion, ResNet, and MaskRCNN). This simplifies the deployment process and allows practitioners to deploy PyTorch pretrained models easily.

AITemplate optimizations

AITemplate has one of the most advanced kernel fusion systems in the industry, thanks to its support of three innovative optimizations: vertical, horizontal, and memory fusions. Vertical fusions fuse chains of operations together. Horizontal fusions fuse parallel operations with no dependency together into a single grouped op. Memory fusions fuse memory movement ops and computation-intensive operations together. Vertical, horizontal, and memory fusions can also be combined.

AITemplate can combine three fusions together to accelerate inference.

As to horizontal fusions, AITemplate currently supports grouped GEMM operations, grouped GEMM + activation ops, and grouped layernorm/swish layernorm operations. AITemplate supports several vertical fusions beyond just standard element-wise operations. These include:

GEMM and element-wise fusions through CUTLASS and Composable Kernels epilogue fusion
GEMM and permute fusions for transformer multihead attention blocks
Fusion of memory operations, such as split, slice, and concatenate, with other ops to reduce memory bandwidth via Tensor Accessors

For standard transformer multihead attention blocks, AITemplate currently relies on Flash Attention on NVIDIA GPUs and generalized back-to-back GEMM/softmax/GEMM fusion in Composable Kernels on AMD GPUs. Both implementations completely remove the data traffic between the compute unit and HBM (high-bandwidth memory) for the intermediate result. With Composable Kernels, not only in attention blocks, a wide range of bottleneck structures in neural networks can be fused. Many problems that were bandwidth-bound now become compute-bound, so the system can utilize GPU compute power much more efficiently. This optimization is more effective for transformer models with long sequences, as shown below.

Our approach extends beyond previous systems by generating templates within a compiler, such as state-of-the-art multidimensional fusion — horizontal fusion, vertical fusion, and memory fusion — but also introduces a unified solution for both NVIDIA and AMD GPUs.

Developing AITemplate

AITemplate has two layers of template systems: The first is the Python Jinja2 template, and the second is the GPU Tensor Core/Matrix Core C++ template (CUTLASS for NVIDIA GPUs and Composable Kernel for AMD GPUs). AITemplate first runs profiling to find the best kernel configuration in Python, and then renders the Jinja2 template into C++ code.

After the model’s source code is generated, the GPU C++ compiler (NVIDIA NVCC and AMD HIPCC) compiles the source code into the final binary code for the model. With its front-end design, which is similar to PyTorch, users can easily convert their models to AITemplate from many different frameworks, including PyTorch.

Greener computing

Our techniques expand the availability of AI platforms and can help reduce carbon emissions to address environmental concerns. Studies show that GPU usage can be tied to carbon emissions. AITemplate reduces GPU execution time, which will also reduce emissions. Since AI models are deployed in the core systems of technology companies around the world, greater efficiency can have a significant impact. The system also makes running inference of trained AI models more accessible, by allowing more platform choices for AI inference workloads.

Extending AITemplate to new hardware and adding more functionality

AITemplate offers state-of-the-art performance for current and next-gen NVIDIA and AMD GPUs with less system complexity. However, we are only at the beginning of our journey to build a high-performance AI inference engine. We are actively working on enhancing AITemplate with more optimizations and full dynamic shape support. We also plan to extend AITemplate to additional hardware systems, such as Apple M-series GPUs, as well as CPUs from other technology providers. Beyond this, we are working on the automatic lowering of PyTorch models to provide an additional turnkey inference solution for PyTorch. We are also open to exploring integrations with other frameworks, such as ONNX and Open-XLA. We hope to build a greener and more efficient AI inference ecosystem with better performance, higher flexibility, and more back-end choices.

Get the code:

https://github.com/facebookincubator/AITemplate

This work is being undertaken by a wide-ranging team at Meta that includes Bing Xu, Ying Zhang, Hao Lu, Yang Chen, Terry Chen, Mike Iovine, Mu-Chu Lee, Scott Wolchok, Oleg Khabinov, Shirong Wu, Huaming Li, Hui Guo, Zhijing Li, Max Podkorytov, Janet Yang, Yinghai Lu, Lu Fang, Andrew Tulloch, and Ajit Mathews.

* This diagram does not include the Better Transformer, introduced in PyTorch 1.12.

**Reproduce code and instruction can be found in the repo examples folder.