Optimizing infrastructure for neural recommendation at scale

2/13/2020

What the research is:

We are sharing an in-depth characterization and analysis for infrastructures used to deliver personalized results in deep neural network-based (DNN) recommendation at scale. Although DNNs are often used to help generate search results, to provide content suggestions, and for other common applications for internet services, relatively little research attention has been devoted to optimizing system infrastructures to serve such recommendations at scale. In addition to sharing insights about how this important class of neural recommendation models performs at production scale, we’ve also released the open source workloads and related performance metrics that we used, to help other researchers and engineers to evaluate their DNNs.

Notable findings from this analysis include the following:

System heterogeneity leads to wide variations in inference latency across three generations of Intel servers.
Batching and colocation of recommendation inference can drastically improve latency-bounded throughput.
Heterogeneity in recommendation model architectures necessitates different system optimization strategies.

How it works:

To analyze the performance of production-scale recommendation models, we first identified quantitative metrics to evaluate recommendation workloads. We then designed a set of synthetic recommendation models to characterize inference performance on a variety of server-class Intel CPU systems. Our results highlight the unique challenges posed by efforts to increase the efficiency of DNNs used for recommendations, compared with the techniques used to optimize traditional convolutional neural network and recurrent neural network architectures.

For example, we found that the three generations of Intel servers commonly used in data centers — Broadwell, Haswell, and Skylake architectures — handle inference latency differently when serving production-scale recommendation models. Skylake systems make it easier to accelerate compute-intensive recommendation, and the exclusive cache hierarchy is less susceptible to latency degradation when multiple models are co-located on the same system. Given the improvement in throughput when colocating models, recognizing these characteristics can help improve how data centers schedule recommendation inference queries and optimize infrastructure efficiency.

This chart shows the execution flow of deep learning recommendation inference: Inputs to the model (N) are a collection of continuous (dense) and categorical (sparse) features. Sparse features, unique to recommendation models, are transformed to a dense representation using embedding tables (shown in blue). The number/size of embedding tables, number of sparse feature (ID) lookups per table, depth/width of Bottom-FC and Top-FC layers varies based on the use-case.

More generally, we showed that DNN-based recommendation systems differ from traditional neural networks in several important ways:

High-quality personalized recommendation requires much larger storage capacity.
At-scale recommendation inference execution produces irregular memory accesses.
The diversity of recommendation use cases in production can produce a diverse set of operator-level performance bottlenecks.

The resource requirement characteristics are due in part to the prevalence of both sparse and dense features; when ranking videos, for example, models must account for each individual user providing sparse input, interacting with only a handful of the thousands or even millions of videos available on a given platform. Engineers need to consider a wide range of performance and resource requirement characteristics when accelerating DNN-based recommendation models, including designing and optimizing for the recommendation inference hardware. Additional details of the system-level analysis and architectural insights are available in the paper linked below.

Why it matters:

Improving infrastructure efficiency for at-scale recommendation inference will hopefully contribute to faster and more accurate personalized recommendations for videos, products, and other ranked results. The insights from this analysis can be used to motivate broader system and architecture optimizations for at-scale recommendation.

This work builds on Facebook’s previous release of an advanced deep learning recommendation model, which can enable algorithmic experimentation and benchmarking for recommendation systems. We hope that sharing our results and open source synthetic models will shed further light on optimization opportunities for next generation AI-systems and help accelerate innovation across the AI community related to the design and modeling of neural recommendation systems.