Research

Accelerating distributed training with Stochastic Gradient Push

June 06, 2019

What the research is:

A novel variant of a distributed optimization method called Stochastic Gradient Push (SGP) for leveraging a large cluster of GPUs to train deep neural networks (DNN) on large-scale datasets. The method is resilient to issues that commonly come up when training on large clusters, including some nodes running slower than others or some unreliable communication links.

Our method enables distributed algorithms to run significantly faster than parallel Stochastic Gradient Descent (SGD), which uses AllReduce for synchronization, in communication-bound settings. In this setting, it also trains better models in less time than with SGD. For example, a variant of SGP trains a model with final top-1 validation accuracy that is one percent higher than that of standard parallel SGD in half the time. For researchers who want to reproduce or build on this work, we’ve made our code publicly available here.

How it works:

We propose and analyze a variant of SGP, called Overlap SGP, which overlaps communication and computation to hide communication overhead. SGP is an algorithm that blends parallel SGD and the push-sum operation for approximate distributed averaging. Distributed data-parallel methods for training DNN aim to speed up training by taking advantage of parallel computing resources to concurrently process multiple data points. This involves some communication overhead to synchronize models across all of the compute nodes.

Standard parallel SGD uses the AllReduce operation for this synchronization, but AllReduce is a blocking operation, which means all nodes must wait until the operation completes before they proceed to the next step, so one slow node or one slow communication link will delay the entire system. In contrast, push-sum is a gossip-based message passing algorithm, which can be run in a nonblocking, asynchronous manner. In gossip protocols, each node sends and receives messages to a small subset of other nodes in the system, and information gradually diffuses across the network. With push-sum, nodes do not need to wait until any operation completes before they proceed to the next step in the algorithm.

SGP is obtained by interleaving one local SGD update with one iteration of push-sum at each node. SGP (and Overlap SGP) converges to a stationary point of smooth nonconvex functions. Existing gossip-based approaches explored in the context of training DNNs are constrained to use symmetric communication (aka push-pull). For instance, if node i sends to node j, then i must also receive from j before proceeding. This inherently requires deadlock avoidance and more synchronization, making these approaches slower and more sensitive to stragglers.

Our approach, instead, uses directed messaging (push only). This enables the use of generic communication topologies that may be directed (asymmetric), sparse, and time-varying. To hide communication overhead, we can overlap gradient computation with communication. Nodes send messages to their out-neighbors after every update (nonblocking). They can also receive incoming messages at any time and incorporate them before the next update. Asynchronous algorithms generally run faster but may introduce additional errors that hamper performance, leading to a trade-off. We can explicitly control the degree of asynchrony in SGP: If a node hasn’t received messages from an in-neighbor after a certain number of iterations, then it waits to receive the messages before proceeding. We study the methods over several computing infrastructures and provide assessments on image classification and neural machine translation tasks.

Why it matters:

Deep learning is largely an empirical field, where SGD and related first-order methods have been the workhorse for training neural networks. As researchers continue to explore increasingly larger models, communication overhead will remain a bottleneck for distributed training. Our new method helps accelerate the training of DNN models using asynchronous distributed data-parallel algorithms.

Being able to train models faster will make it possible to retrain models more frequently and explore new models more rapidly. Subsequently, this increases the potential for building models with more accurate and relevant predictions. With this study and public release of our code, we hope the AI community can build on this work and, ultimately, advance science faster.