Machine learning experts from around the world are gathering this week in Long Beach for the 2019 International Conference on Machine Learning (ICML). Research from Facebook will be presented in oral spotlight presentations and group poster sessions.
Our researchers and engineers will also be participating in other activities throughout the week, including several workshops ranging from Generative Modeling and Model-Based Reasoning for Robotics and AI to Self-Supervised Learning. As part of our commitment to diversify the field, Facebook AI is also co-sponsoring two additional events at ICML: the Women in Machine Learning (WiML) dinner and the Latinx in AI Workshop.
For those attending ICML, be sure to visit the Facebook Research booth to learn more about what we’re working on.
We introduce a new beam search decoder that is fully differentiable, making it possible to optimize at training time through the inference procedure. Our decoder allows us to combine models which operate at different granularities (e.g., acoustic and language models). It can be used when target sequences are not aligned to input sequences by considering all possible alignments between the two. We demonstrate our approach scales by applying it to speech recognition, jointly training acoustic and word-level language models. The system is end-to-end, with gradients flowing through the whole architecture from the word-level transcriptions. Recent research efforts have shown that deep neural networks with attention-based mechanisms can successfully train an acoustic model from the final transcription, while implicitly learning a language model. Instead, we show that it is possible to discriminatively train an acoustic model jointly with an explicit and possibly pretrained language model.
Rachel Ward, XiaoXia Wu, Leon Bottou
Adaptive gradient methods such as AdaGrad and its variants update the stepsize in stochastic gradient descent on the fly according to the gradients received along the way; such methods have gained widespread use in large-scale optimization for their ability to converge robustly, without the need to fine-tune parameters such as the stepsize schedule. Yet, the theoretical guarantees to date for AdaGrad are for online and convex optimization. We bridge this gap by providing strong theoretical guarantees for the convergence of AdaGrad over smooth, nonconvex landscapes. We show that the norm version of AdaGrad (AdaGrad-Norm) converges to a stationary point at the Õ(log(N)/ √ N) rate in the stochastic setting, and at the optimal Õ(1/N) rate in the batch (non-stochastic) setting – in this sense, our convergence guarantees are “sharp.” In particular, both our theoretical results and extensive numerical experiments imply that AdaGrad-Norm is robust to the unknown Lipschitz constant and level of stochastic noise on the gradient.
Noam Brown, Adam Lerer, Sam Gross, Tuomas Sandholm
Counterfactual Regret Minimization (CFR) is the leading framework for solving large imperfect-information games. It converges to an equilibrium by iteratively traversing the game tree. In order to deal with extremely large games, abstraction is typically applied before running CFR. The abstracted game is solved with tabular CFR, and its solution is mapped back to the full game. This process can be problematic because aspects of abstraction are often manual and domain specific, abstraction algorithms may miss important strategic nuances of the game, and there is a chicken-and-egg problem because determining a good abstraction requires knowledge of the equilibrium of the game. This paper introduces Deep Counterfactual Regret Minimization, a form of CFR that obviates the need for abstraction by instead using deep neural networks to approximate the behavior of CFR in the full game. We show that Deep CFR is principled and achieves strong performance in large poker games. This is the first non-tabular variant of CFR to be successful in large games.
Arjun Seshadri, Alex Peysakhovich, Johan Ugander
Many applications in preference learning assume that decisions come from the maximization of a stable utility function. Yet a large experimental literature shows that individual choices and judgements can be affected by “irrelevant” aspects of the context in which they are made. An important class of such contexts is the composition of the choice set. In this work, our goal is to discover such choice set effects from raw choice data. We introduce an extension of the Multinomial Logit (MNL) model, called the context dependent random utility model (CDM), which allows for a particular class of choice set effects. We show that the CDM can be thought of as a second-order approximation to a general choice system, can be inferred optimally using maximum likelihood and, importantly, is easily interpretable. We apply the CDM to both real and simulated choice data to perform principled exploratory analyses for the presence of choice set effects.
The AlphaGo, AlphaGo Zero, and AlphaZero series of algorithms are remarkable demonstrations of deep reinforcement learning’s capabilities, achieving superhuman performance in the complex game of Go with progressively increasing autonomy. However, many obstacles remain in the understanding of and usability of these promising approaches by the research community. Toward elucidating unresolved mysteries and facilitating future research, we propose ELF OpenGo, an open-source reimplementation of the AlphaZero algorithm. ELF OpenGo is the first open-source Go AI to convincingly demonstrate superhuman performance with a perfect (20:0) record against global top professionals. We apply ELF OpenGo to conduct extensive ablation studies, and to identify and analyze numerous interesting phenomena in both the model training and in the gameplay inference procedures. Our code, models, selfplay data sets, and auxiliary data are publicly available.
Over the past few years, neural networks were proven vulnerable to adversarial images: Targeted but imperceptible image perturbations lead to drastically different predictions. We show that adversarial vulnerability increases with the gradients of the training objective when viewed as a function of the inputs. Surprisingly, vulnerability does not depend on network topology: For many standard network architectures, we prove that at initialization, the l1-norm of these gradients grows as the square root of the input dimension, leaving the networks increasingly vulnerable with growing image size. We empirically show that this dimension-dependence persists after either usual or robust training, but gets attenuated with higher regularization.
Branislav Kveton, Csaba Szepesvari, Sharan Vaswani, Zheng Wen, Mohammad Ghavamzadeh, Tor Lattimore
We propose a bandit algorithm that explores by randomizing its history of rewards. Specifically, it pulls the arm with the highest mean reward in a non-parametric bootstrap sample of its history with pseudo rewards. We design the pseudo rewards such that the bootstrap mean is optimistic with a sufficiently high probability. We call our algorithm Giro, which stands for “garbage in, reward out.” We analyze Giro in a Bernoulli bandit and derive a O(K∆−1 log n) bound on its n-round regret, where ∆ is the difference in the expected rewards of the optimal and the best suboptimal arms, and K is the number of arms. The main advantage of our exploration design is that it easily generalizes to structured problems. To show this, we propose contextual Giro with an arbitrary reward generalization model. We evaluate Giro and its contextual variant on multiple synthetic and real-world problems, and observe that it performs well.
Mohamed Elfeki, Camille Couprie, Morgane Rivière, Mohamed Elhoseiny
Generative models have proven to be an outstanding tool for representing high-dimensional probability distributions and generating realistic looking images. A fundamental characteristic of generative models is their ability to produce multi-modal outputs. However, while training, they are often susceptible to mode collapse, that is models are limited in mapping the input noise to only a few modes of the true data distribution. In this paper, we draw inspiration from Determinantal Point Process (DPP) to realize a generative model that alleviates mode collapse while producing higher quality samples. DPP is an elegant probabilistic measure used to model negative correlations within a subset and hence quantify its diversity. We use DPP kernel to model the diversity in real data as well as in synthetic data. Then, we devise a generation penalty term that encourages the generator to synthesize data with a similar diversity to real data. In contrast to previous state-of-the-art generative models that tend to use additional trainable parameters or complex training paradigms, our method does not change the original training scheme. Embedded in an adversarial training and variational autoencoder, our Generative DPP approach shows a consistent resistance to mode-collapse on a wide-variety of synthetic data and natural image data sets, while outperforming state-of-the-art methods for data-efficiency, convergence-time, and generation quality. Our code will be made publicly available.
Edward J. Smith, Scott Fujimoto, Adriana Romero, Dave Meger
Mesh models are a promising approach for encoding the structure of 3D objects. Current mesh reconstruction systems predict uniformly distributed vertex locations of a predetermined graph through a series of graph convolutions, leading to compromises with respect to performance or resolution. In this paper, we argue that the graph representation of geometric objects allows for additional structure, which should be leveraged for enhanced reconstruction. Thus, we propose a system which properly benefits from the advantages of the geometric structure of graph-encoded objects by introducing (1) a graph convolutional update preserving vertex information; (2) an adaptive splitting heuristic allowing detail to emerge; and (3) a training objective operating both on the local surfaces defined by vertices as well as the global structure defined by the mesh. Our proposed method is evaluated on the task of 3D object reconstruction from images with the ShapeNet data set, where we demonstrate state-of-the-art performance, both visually and numerically, while having far smaller space requirements by generating adaptive meshes.
Corentin Tallec, Léonard Blier, Yann Ollivier
Despite remarkable successes, Deep Reinforcement Learning (DRL) is not robust to hyperparameterization, implementation details, or small environment changes (Henderson et al. 2017, Zhang et al. 2018). Overcoming such sensitivity is key to making DRL applicable to real world problems. In this paper, we identify sensitivity to time discretization in near continuous-time environments as a critical factor; this covers, e.g., changing the number of frames per second, or the action frequency of the controller. Empirically, we find that Q-learning-based approaches such as Deep Q-learning (Mnih et al., 2015) and Deep Deterministic Policy Gradient (Lillicrap et al., 2015) collapse with small time steps. Formally, we prove that Q-learning does not exist in continuous time. We detail a principled way to build an off-policy RL algorithm that yields similar performances over a wide range of time discretizations, and confirm this robustness empirically.
Vikas Verma, Alex Lamb, Christopher Beckham, Amir Najafi Sharif, Ioannis Mitliagkas, David Lopez-Paz, Yoshua Bengio
Deep neural networks excel at learning the training data, but often provide incorrect and confident predictions when evaluated on slightly different test examples. This includes distribution shifts, outliers, and adversarial examples. To address these issues, we propose Manifold Mixup, a simple regularizer that encourages neural networks to predict less confidently on interpolations of hidden representations. Manifold Mixup leverages semantic interpolations as additional training signal, obtaining neural networks with smoother decision boundaries at multiple levels of representation. As a result, neural networks trained with Manifold Mixup learn flatter class-representations, that is, with fewer directions of variance. We prove theory on why this flattening happens under ideal conditions, validate it empirically on practical situations, and connect it to the previous works on information theory and generalization. In spite of incurring no significant computation and being implemented in a few lines of code, Manifold Mixup improves strong baselines in supervised learning, robustness to single-step adversarial attacks, and test log-likelihood.
Mixture models trained via EM are among the simplest, most widely used and well-understood latent variable models in the machine learning literature. Surprisingly, these models have been hardly explored in text generation applications such as machine translation. In principle, they provide a latent variable to control generation and produce a diverse set of hypotheses. In practice, however, mixture models are prone to degeneracies — often only one component gets trained or the latent variable is simply ignored. We find that disabling dropout noise in responsibility computation is critical to successful training. In addition, the design choices of parameterization, prior distribution, hard versus soft EM and online versus offline assignment can dramatically affect model performance. We develop an evaluation protocol to assess both quality and diversity of generations against multiple references, and provide an extensive empirical study of several mixture model variants. Our analysis shows that certain types of mixture models are more robust and offer the best trade-off between translation quality and diversity compared to variational models and diverse decoding approaches.
Gourab Kundu, Prahal Arora, Ferdi Adeputra, Polina Kuznetsova, Daniel McKinnon, Michelle Cheung, Larry Anazia, Geoffrey Zweig
Identifying the temporal segments in a video that contain content relevant to a category or task is a difficult but interesting problem. This has applications in fine-grained video indexing and retrieval. Part of the difficulty in this problem comes from the lack of supervision since large-scale annotation of localized segments containing the content of interest is very expensive. In this paper, we propose to use the category assigned to an entire video as weak supervision to our model. Using such weak supervision, our model learns to do joint video level categorization and localization of content relevant to the category of the video. This can be thought of as providing both a classification label and an explanation in the form of the relevant regions of the video. Extensive experiments on a large scale data set show our model can achieve good localization performance without any direct supervision and can combine signals from multiple modalities like speech and vision.
Sean Welleck, Kianté Brantley, Hal Daumé III, Kyunghyun Cho
Standard sequential generation methods assume a prespecified generation order, such as text generation methods which generate words from left to right. In this work, we propose a framework for training models of text generation that operate in non-monotonic orders; the model directly learns good orders, without any additional annotation. Our framework operates by generating a word at an arbitrary position, and then recursively generating words to its left and then words to its right, yielding a binary tree. Learning is framed as imitation learning, including a coaching method which moves from imitating an oracle to reinforcing the policy’s own preferences. Experimental results demonstrate that using the proposed method, it is possible to learn policies which generate text without prespecifying a generation order, while achieving competitive performance with conventional left-to-right generation.
We propose a new class of probabilistic neural-symbolic models, that have symbolic functional programs as a latent, stochastic variable. Instantiated in the context of visual question answering, our probabilistic formulation offers two key conceptual advantages over prior neural-symbolic models for VQA. Firstly, the programs generated by our model are more understandable while requiring less number of teaching examples. Secondly, we show that one can pose counterfactual scenarios to the model, to probe its beliefs on the programs that could lead to a specified answer given an image. Our results on the CLEVR and SHAPES data sets verify our hypotheses, showing that the model gets better program (and answer) prediction accuracy even in the low data regime, and allows one to probe the coherence and consistency of reasoning performed.
Deepak Pathak, Dhiraj Gandhi, Abhinav Gupta
Exploration has been a long-standing problem in both model-based and model-free learning methods for sensorimotor control. There have been major advances in recent years demonstrated in noise-free, non-stochastic domains such as video games and simulation. However, most of the current formulations get stuck when there are stochastic dynamics. In this paper, we propose a formulation for exploration inspired from the work in active learning literature. Specifically, we train an ensemble of dynamics models and incentivize the agent to maximize the disagreement or variance of those ensembles. We show that this formulation works as well as other formulations in non-stochastic scenarios, and is able to explore better in scenarios with stochastic-dynamics. Further, we show that this objective can be leveraged to perform differentiable policy optimization. This leads to a sample efficient exploration policy. We show experiments on a large number of standard environments to demonstrate the efficacy of this approach. Furthermore, we implement our exploration algorithm on a real robot which learns to interact with objects completely from scratch.
In many finite horizon episodic reinforcement learning (RL) settings, it is desirable to optimize for the undiscounted return – in settings like Atari, for instance, the goal is to collect the most points while staying alive in the long run. Yet, it may be difficult (or even intractable) mathematically to learn with this target. As such, temporal discounting is often applied to optimize over a shorter effective planning horizon. This comes at the cost of potentially biasing the optimization target away from the undiscounted goal. In settings where this bias is unacceptable – where the system must optimize for longer horizons at higher discounts – the target of the value function approximator may increase in variance leading to difficulties in learning. We present an extension of temporal difference (TD) learning, which we call TD(∆), that breaks down a value function into a series of components based on the differences between value functions with smaller discount factors. The separation of a longer horizon value function into these components has useful properties in scalability and performance. We discuss these properties and show theoretic and empirical improvements over standard TD learning in certain settings.
Mahmoud Assran, Nicolas Loizou, Nicolas Ballas, Mike Rabbat
Distributed data-parallel algorithms aim to accelerate the training of deep neural networks by parallelizing the computation of large mini-batch gradient updates across multiple nodes. Approaches that synchronize nodes using exact distributed averaging (e.g., via ALLREDUCE) are sensitive to stragglers and communication delays. The PUSHSUM gossip algorithm is robust to these issues, but only performs approximate distributed averaging. This paper studies Stochastic Gradient Push (SGP), which combines PUSHSUM with stochastic gradient updates. We prove that SGP converges to a stationary point of smooth, non-convex objectives at the same sub-linear rate as SGD, and that all nodes achieve consensus. We empirically validate the performance of SGP on image classification (ResNet-50, ImageNet) and machine translation (Transformer, WMT’16 EnDe) workloads. Our code will be made publicly available.
We propose a targeted communication architecture for multi-agent reinforcement learning, where agents learn both what messages to send and to whom to address them while performing cooperative tasks in partially observable environments. This targeting behavior is learnt solely from downstream task-specific reward without any communication supervision. We additionally augment this with a multi-round communication approach where agents coordinate via multiple rounds of communication before taking actions in the environment. We evaluate our approach on a diverse set of cooperative multi-agent tasks, of varying difficulties, with varying number of agents, in a variety of environments ranging from 2D grid layouts of shapes and simulated traffic junctions to 3D indoor environments, and demonstrate the benefits of targeted and multi-round communication. Moreover, we show that the targeted communication strategies learned by agents are interpretable and intuitive. Finally, we show that our architecture can be easily extended to mixed and competitive environments, leading to improved performance and sample complexity over recent state-of-the-art approaches.
Ashwin Vijayakumar, Peter Anderson, Stefan Lee, Dhruv Batra
Many sequence prediction tasks admit multiple correct outputs and so, it is often useful to decode a set of outputs that maximize some task-specific set-level metric. However, retooling standard sequence prediction procedures tailored toward predicting single best outputs tends to produce sets containing very similar sequences; failing to capture the variation in the output space. To address this, we propose ∇BS, a trainable decoding procedure that outputs a set of sequences, highly valued according to the metric. Our method tightly integrates the training and decoding phases and further allows for the optimization of the task-specific metric addressing the shortcomings of standard sequence prediction. Further, we discuss the trade-offs of commonly used set-level metrics and motivate a new set-level metric that naturally evaluates the notion of “capturing the variation in the output space.” Finally, we show results on the image captioning task and find that our model outperforms standard techniques and natural ablations.
Xavier Bouthillier, César Laurent, Pascal Vincent
The apparent contradiction in the title is a wordplay on the different meanings attributed to the word reproducible across different scientific fields. What we imply is that unreproducible findings can be built upon reproducible methods. Without denying the importance of facilitating the reproduction of methods, we deem important to reassert that reproduction of findings is a fundamental step of the scientific inquiry. We argue that the commendable quest towards easy deterministic reproducibility of methods and numerical results should not have us forget the even more important necessity of ensuring the reproducibility of empirical conclusions and findings by properly accounting for essential sources of variations. We provide experiments to exemplify the brittleness of current common practice in the evaluation of models in the field of deep learning, showing that even if the results could be reproduced, a slightly different experiment would not support the findings. We hope to help clarify the distinction between exploratory and empirical research in the field of deep learning and believe more energy should be devoted to proper empirical research in our community. This work is an attempt to promote the use of more rigorous and diversified methodologies. It is not an attempt to impose a new methodology and it is not a critique on the nature of exploratory research.
Membership inference determines, given a sample and trained parameters of a machine learning model, whether the sample was part of the training set. In this paper, we derive the optimal strategy for membership inference with a few assumptions on the distribution of the parameters. We show that optimal attacks only depend on the loss function, and thus black-box attacks are as good as white-box attacks. As the optimal strategy is not tractable, we provide approximations of it leading to several inference methods, and show that existing membership inference methods are coarser approximations of this optimal strategy. Our membership attacks outperform the state of the art in various settings, ranging from a simple logistic regression to more complex architectures and data sets, such as ResNet-101 and Imagenet.
Organizers: Ari Morcos, Samy Bengio, Behnam Neyshabur, Ludwig Schmidt, Maithra Raghu, Hanie Sedghi, Kenji Hata, Ying Xiao, Ali Rahimi
Organizers: Sarath Chandar, Chelsea Finn, Abhishek Gupta, Khimya Khetarpal, Andrei Rusu, Shagun Sodhani, Amy Zhang
Organizers: Alborz Geramifard, Lihong Li, Yuxi Li, Csaba Szepesvari, Tao Wang, Pieter Abbeel, Craig Boutilier, Emma Brunkskill, John Langford, David Silver, David Sontag
Speakers: Yann LeCun, Chelsea Finn, Andrew Zisserman, Alexei Efros, Jacob Devlin, Abhinav Gupta