October 16, 2020
An understanding of when and why adding momentum results in more efficient optimization of machine learning models. Optimization algorithms are key to modern machine learning. They govern how a model incorporates information during training and adapts to perform better during its next attempt. Of these optimization algorithms, the stochastic gradient descent method with momentum (SGD+M) is one of the most common across many subfields, including computer vision.
But despite its prevalence, SGD+M is still poorly understood theoretically. The practical benefits of momentum are undeniable, but up to this point the best-known theory bounding the speed of learning (or convergence) rated SGD+M worse than SGD sans momentum, which is at odds with real-world performance.
Our work establishes a tighter convergence rate bound, providing a fuller understanding of how momentum works and thus when SGD+M is the most efficient optimization algorithm.
Our analysis of SGD+M relies on a crucial idea: Momentum is better understood when implemented in stochastic primal averaging (SPA) form. The SPA equational form is equivalent in function to SGD+M but much easier to analyze mathematically.
Using the SPA form allows us to draw a tighter convergence rate bound than previous research into this problem. We’ve determined that under a reasonable assumption on the gradients, the convergence rate bound for our theory is almost precisely as tight as it is for SGD, with an extra factor that is negligible for any reasonable choice of momentum and step size. When no additional assumptions are made, the gap is reduced between the bounds for SGD and SGD+M to only a factor of square root 2 in the final convergence rate.
Because SPA is easier to analyze, we can derive a number of insights about momentum’s effects on deep learning problems that are otherwise unobtainable with SGD+M in its standard form. For instance, we’ve found that momentum accelerates convergence when the gradient buffer is positively correlated with the true gradient, thereby canceling the contribution of noise to the convergence rate bound. Empirically, it’s been demonstrated that this happens during the early iterations of training. Since these early iterations are also the most unstable, this explains why momentum accelerates learning more in early iterations and is subsequently less useful toward the end of the optimization process.
Analyzing SGD+M via the SPA form also provides insight into how researchers can most efficiently adjust learning rates during training. Traditionally, the learning rate parameter is decreased at regular intervals to allow the model to fine-tune itself to achieve the best possible performance. This common scheme actually no longer makes sense when using the SPA form. Instead, our theory supports a modified scheme, wherein best results are achieved by decreasing the momentum parameter instead of the learning rate parameter.
Momentum is a key component in modern deep learning training, and understanding it better is vital for improving optimization algorithms. Using the SPA form has allowed us to break down properties of momentum that previously we could only observe in practice. Our findings suggest that when adding momentum to more advanced optimization methods, the key is to use the kind of averaging found in the SPA form — meaning an averaging of the steps, not an averaging of the gradients. This insight allows us to apply momentum to virtually any existing optimization method to potentially improve its performance.
The development of better optimization algorithms directly results in faster training of AI systems, accelerating research and making it more broadly accessible.