Interpolation consistency training for semi-supervised learning
We introduce Interpolation Consistency Training (ICT), a simple and computation efficient algorithm for training Deep Neural Networks in the semi-supervised learning paradigm.
Learning about an exponential amount of conditional distributions
We introduce the Neural Conditioner (NC), a self-supervised machine able to learn about all the conditional distributions of a random vector X.
Gradient descent happens in a tiny subspace
We show that in a variety of large-scale deep learning scenarios the gradient dynamically converges to a very small subspace after a short period of training.
On the curved geometry of accelerated optimization
By considering the optimization procedure as occurring on a Riemannian manifold with a natural structure, The Nesterov accelerated gradient method can be seen as the proximal point method applied in this curved space.
Controlling covariate shift using equilibrium normalization of weights
We introduce a new normalization technique that exhibits the fast convergence properties of batch normalization using a transformation of layer weights instead of layer outputs.
On the ineffectiveness of variance reduced optimization for deep learning
We show that naive application of the SVRG technique and related approaches fail, and explore why.
Stochastic gradient push for distributed deep learning
This paper studies Stochastic Gradient Push (SGP), which combines PushSum with stochastic gradient updates.
Frequentist uncertainty estimates for deep learning
We provide frequentist estimates of aleatoric and epistemic uncertainty for deep neural networks. To estimate aleatoric uncertainty we propose simultaneous quantile regression, a loss function to learn all the conditional quantiles of a given target variable.
Fluctuation-dissipation relations for stochastic gradient descent
Here, we derive stationary fluctuation-dissipation relations that link measurable quantities and hyperparameters in the stochastic gradient descent algorithm.
Manifold mixup: better representations by interpolating hidden states
Deep neural networks excel at learning the training data, but often provide incorrect and confident predictions when evaluated on slightly different test examples. This includes distribution shifts, outliers, and adversarial examples. To address these issues, we propose Manifold Mixup, a simple regularizer that encourages neural networks to predict less confidently on interpolations of hidden representations.
AdaGrad stepsizes: sharp convergence over nonconvex landscapes, from any initialization
We show that the norm version of AdaGrad (AdaGrad-Norm) converges to a stationary point at the O(log(N)/sqrt(N)) rate in the stochastic setting, and at the optimal O(1/N) rate in the batch (non-stochastic) setting -- in this sense, our convergence guarantees are 'sharp'.
WNGrad: learn the learning rate in gradient descent
Inspired by batch normalization, we propose a general nonlinear update rule for the learning rate in batch and stochastic gradient descent so that the learning rate can be initialized at a high value, and is subsequently decreased according to gradient observations along the way.
Adversarial vulnerability of neural networks increases with input dimension
We show that adversarial vulnerability increases with the gradients of the training objective when viewed as a function of the inputs.
Geometrical insights for implicit generative modeling
Learning algorithms for implicit generative models can optimize a variety of criteria that measure how the data distribution differs from the implicit model distribution, including the Wasserstein distance, the Energy distance, and the Maximum Mean Discrepancy criterion.
Mixup: beyond empirical risk minimization
Our experiments on the ImageNet-2012, CIFAR-10, CIFAR-100, Google commands and UCI datasets show that mixup improves the generalization of state-of-the-art neural network architectures.