Research Area


Artificial intelligence has enjoyed immense practical success in recent years, largely due to advances in machine learning, especially deep learning via optimization. Rich mathematical theory explaining the many empirical results can help drive further advances, informed by feedback from those advances.

The latest results connect with celebrated techniques in learning theory, optimization, signal processing, and statistics. The interplay between rigorous theory and engineering advances pushes forward the frontiers of AI.

Latest Publications


Interpolation consistency training for semi-supervised learning

We introduce Interpolation Consistency Training (ICT), a simple and computation efficient algorithm for training Deep Neural Networks in the semi-supervised learning paradigm.


Learning about an exponential amount of conditional distributions

We introduce the Neural Conditioner (NC), a self-supervised machine able to learn about all the conditional distributions of a random vector X.


Gradient descent happens in a tiny subspace

We show that in a variety of large-scale deep learning scenarios the gradient dynamically converges to a very small subspace after a short period of training.


On the curved geometry of accelerated optimization

By considering the optimization procedure as occurring on a Riemannian manifold with a natural structure, The Nesterov accelerated gradient method can be seen as the proximal point method applied in this curved space.


Controlling covariate shift using equilibrium normalization of weights

We introduce a new normalization technique that exhibits the fast convergence properties of batch normalization using a transformation of layer weights instead of layer outputs.


On the ineffectiveness of variance reduced optimization for deep learning

We show that naive application of the SVRG technique and related approaches fail, and explore why.


Stochastic gradient push for distributed deep learning

This paper studies Stochastic Gradient Push (SGP), which combines PushSum with stochastic gradient updates.


Frequentist uncertainty estimates for deep learning

We provide frequentist estimates of aleatoric and epistemic uncertainty for deep neural networks. To estimate aleatoric uncertainty we propose simultaneous quantile regression, a loss function to learn all the conditional quantiles of a given target variable.


Fluctuation-dissipation relations for stochastic gradient descent

Here, we derive stationary fluctuation-dissipation relations that link measurable quantities and hyperparameters in the stochastic gradient descent algorithm.


Manifold mixup: better representations by interpolating hidden states

Deep neural networks excel at learning the training data, but often provide incorrect and confident predictions when evaluated on slightly different test examples. This includes distribution shifts, outliers, and adversarial examples. To address these issues, we propose Manifold Mixup, a simple regularizer that encourages neural networks to predict less confidently on interpolations of hidden representations.


AdaGrad stepsizes: sharp convergence over nonconvex landscapes, from any initialization

We show that the norm version of AdaGrad (AdaGrad-Norm) converges to a stationary point at the O(log(N)/sqrt(N)) rate in the stochastic setting, and at the optimal O(1/N) rate in the batch (non-stochastic) setting -- in this sense, our convergence guarantees are 'sharp'.


WNGrad: learn the learning rate in gradient descent

Inspired by batch normalization, we propose a general nonlinear update rule for the learning rate in batch and stochastic gradient descent so that the learning rate can be initialized at a high value, and is subsequently decreased according to gradient observations along the way.


Adversarial vulnerability of neural networks increases with input dimension

We show that adversarial vulnerability increases with the gradients of the training objective when viewed as a function of the inputs.


Geometrical insights for implicit generative modeling

Learning algorithms for implicit generative models can optimize a variety of criteria that measure how the data distribution differs from the implicit model distribution, including the Wasserstein distance, the Energy distance, and the Maximum Mean Discrepancy criterion.


Mixup: beyond empirical risk minimization

Our experiments on the ImageNet-2012, CIFAR-10, CIFAR-100, Google commands and UCI datasets show that mixup improves the generalization of state-of-the-art neural network architectures.

Join Us

Tackle the world's most complex technology challenges.

Join Our Team

Latest News

Visit the AI Blog for updates on recent publications, new tools, and more.

Visit Blog