The 2019 Conference on Neural Information Processing Systems (NeurIPS) is taking place in Vancouver, British Columbia, from Sunday, December 8, to Saturday, December 14. With over 15,000 attendees, NeurIPS is the largest conference in AI, with machine learning and neuroscience experts traveling from around the world to discuss the latest advances in the field. Facebook researchers and engineers in AI, core data science, networking and infrastructure, augmented and virtual reality, and more are presenting their research in poster sessions, spotlight talks, workshops, and tutorials at the conference. We're also launching the Deepfake Detection Challenge.
From Sunday morning through Wednesday evening, attendees can visit the Facebook exhibit booth to meet researchers, try some demos and tutorials, and speak with our recruitment team. On Tuesday afternoon, attendees can join AI Residents and Mentors for a Q&A on the AI Residency program and application process.
This year, Facebook’s contribution to the NeurIPS Expo are workshops featuring PyTorch: Multi-modal Research to Production with PyTorch and Facebook and Responsible and Reproducible AI with PyTorch and Facebook. In a continued effort to support diversity in AI, we are also supporting the Women in AI, Black in AI, and Latinx in AI workshops on Sunday and Monday.
Visit our NeurIPS 2019 event page for more details on demos and scheduled booth events, and to learn more about Facebook at NeurIPS 2019. A full day-by-day schedule of research being presented at NeurIPS, including activities such as workshops and tutorials, is available here.
Nicolas Carion, Nicolas Usunier, Gabriel Synnaeve, and Alessandro Lazaric
Effective coordination is crucial to solve multi-agent collaborative (MAC) problems. While centralized reinforcement learning methods can optimally solve small MAC instances, they do not scale to large problems and they fail to generalize to scenarios different from those seen during training. In this paper, we consider MAC problems with some intrinsic notion of locality (e.g., geographic proximity) such that interactions between agents and tasks are locally limited. By leveraging this property, we introduce a novel structured prediction approach to assign agents to tasks. At each step, the assignment is obtained by solving a centralized optimization problem (the inference procedure) whose objective function is parameterized by a learned scoring model. We propose different combinations of inference procedures and scoring models able to represent coordination patterns of increasing complexity. The resulting assignment policy can be efficiently learned on small problem instances and readily reused in problems with more agents and tasks (i.e., zero-shot generalization). We report experimental results on a toy search and rescue problem and on several target selection scenarios in StarCraft®: Brood War, in which our model significantly outperforms strong rule-based baselines on instances with five times more agents and tasks than those seen during training.
Rahma Chaabouni, Eugene Kharitonov, Emmanuel Dupoux, and Marco Baroni
Despite renewed interest in emergent language simulations with neural networks, little is known about the basic properties of the induced code, and how they compare to human language. One fundamental characteristic of the latter, known as Zipf’s Law of Abbreviation (ZLA), is that more frequent words are efficiently associated to shorter strings. We study whether the same pattern emerges when two neural networks, a “speaker” and a “listener,” are trained to play a signaling game. Surprisingly, we find that networks develop an anti-efficient encoding scheme, in which the most frequent inputs are associated to the longest messages, and messages in general are skewed toward the maximum length threshold. This anti-efficient code appears easier to discriminate for the listener, and, unlike in human communication, the speaker does not impose a contrasting least-effort pressure towards brevity. Indeed, when the cost function includes a penalty for longer messages, the resulting message distribution starts respecting ZLA. Our analysis stresses the importance of studying the basic features of emergent communication in a highly controlled setup, to ensure the latter will not depart too far from human language. Moreover, we present a concrete illustration of how different functional pressures can lead to successful communication codes that lack basic properties of human language, thus highlighting the role such pressures play in the latter.
Peter Anderson, Ayush Shrivastava, Devi Parikh, Dhruv Batra, and Stefan Lee
A visually grounded navigation instruction can be interpreted as a sequence of expected observations and actions an agent following the correct trajectory would encounter and perform. Based on this intuition, we formulate the problem of finding the goal location in Vision-and-Language Navigation (VLN)  within the framework of Bayesian state tracking — learning observation and motion models conditioned on these expectable events. Together with a mapper that constructs a semantic spatial map on the fly during navigation, we formulate an end-to-end differentiable Bayes filter and train it to identify the goal by predicting the most likely trajectory through the map according to the instructions. The resulting navigation policy constitutes a new approach to instruction following that explicitly models a probability distribution over states, encoding strong geometric and algorithmic priors while enabling greater explainability. Our experiments show that our approach outperforms a strong LingUNet  baseline when predicting the goal location on the map. On the full VLN task, i.e., navigating to the goal location, our approach achieves promising results with less reliance on navigation constraints.
Chhavi Yadav and Leon Bottou
Although the popular MNIST data set [LeCun et al., 1994] is derived from the NIST database [Grother and Hanaoka, 1995], the precise processing steps for this derivation have been lost to time. We propose a reconstruction that is accurate enough to serve as a replacement for the MNIST data set, with insignificant changes in accuracy. We trace each MNIST digit to its NIST source and its rich metadata such as writer identifier, partition identifier, etc. We also reconstruct the complete MNIST test set with 60,000 samples instead of the usual 10,000. Since the balance 50,000 were never distributed, they can be used to investigate the impact of 25 years of MNIST experiments on the reported testing performances. Our limited results unambiguously confirm the trends observed by Recht et al. [2018, 2019]: Although the misclassification rates are slightly off, classifier ordering and model selection remain broadly reliable. We attribute this phenomenon to the pairing benefits of comparing classifiers on the same digits.
People can learn a new concept and use it compositionally, understanding how to “blicket twice” after learning how to “blicket.” In contrast, powerful sequence-to-sequence (seq2seq) neural networks fail such tests of compositionality, especially when composing new concepts together with existing concepts. In this paper, I show how memory-augmented neural networks can be trained to generalize compositionally through meta seq2seq learning. In this approach, models train on a series of seq2seq problems to acquire the compositional skills needed to solve new seq2seq problems. Meta seq2seq learning solves several of the SCAN tests for compositional learning and can learn to apply implicit rules to variables.
Natalia Neverova, David Novotny, and Andrea Vedaldi
Many machine learning methods depend on human supervision to achieve optimal performance. However, in tasks such as DensePose, where the goal is to establish dense visual correspondences between images, the quality of manual annotations is intrinsically limited. We address this issue by augmenting neural network predictors with the ability to output a distribution over labels, thus explicitly and introspectively capturing the aleatoric uncertainty in the annotations. Compared to previous works, we show that correlated error fields arise naturally in applications such as DensePose and these fields can be modelled by deep networks, leading to a better understanding of the annotation errors. We show that these models, by understanding uncertainty better, can solve the original DensePose task more accurately, thus setting the new state-of-the-art accuracy in this benchmark. Finally, we demonstrate the utility of the uncertainty estimates in fusing the predictions produced by multiple models, resulting in a better and more principled approach to model ensembling which can further improve accuracy.
Jianwei Yang, Zhile Ren, Chuang Gan, Hongyuan Zhu, and Devi Parikh
Convolutional neural networks process input data by sending channel-wise feature response maps to subsequent layers. While a lot of progress has been made by making networks deeper, information from each channel can only be propagated from lower levels to higher levels in a hierarchical feed-forward manner. When viewing each filter in the convolutional layer as a neuron, those neurons are not communicating explicitly within each layer in CNNs. We introduce a novel network unit called cross-channel communication (C3) block, a simple yet effective module to encourage the neuron communication within the same layer. The C3 block enables neurons to exchange information through a micro neural network, which consists of a feature encoder, a message communicator, and a feature decoder, before sending the information to the next layer. With C3 block, each neuron accounts for the channel-wise responses from other neurons at the same layer and learns more discriminative and complementary representations. Extensive experiments for multiple computer vision tasks show that our proposed mechanism allows shallower networks to aggregate useful information within each layer, and performances outperform baseline deep networks and other competitive methods.
Akshay Agrawal, Brandon Amos, Shane Barratt, Stephen Boyd, Steven Diamond, and J. Zico Kolter
Recent work has shown how to embed differentiable optimization problems (that is, problems whose solutions can be backpropagated through) as layers within deep learning architectures. This method provides a useful inductive bias for certain problems, but existing software for differentiable optimization layers is rigid and difficult to apply to new settings. In this paper, we propose an approach to differentiating through disciplined convex programs, a subclass of convex optimization problems used by domain-specific languages (DSLs) for convex optimization. We introduce disciplined parametrized programming, a subset of disciplined convex programming, and we show that every disciplined parametrized program can be represented as the composition of an affine map from parameters to problem data, a solver, and an affine map from the solver’s solution to a solution of the original problem (a new form we refer to as affine-solver-affine form). We then demonstrate how to efficiently differentiate through each of these components, allowing for end-to-end analytical differentiation through the entire convex program. We implement our methodology in version 1.1 of CVXPY, a popular Python-embedded DSL for convex optimization, and additionally implement differentiable layers for disciplined convex programs in PyTorch and TensorFlow 2.0. Our implementation significantly lowers the barrier to using convex optimization problems in differentiable programs. We present applications in linear machine learning models and in stochastic control, and we show that our layer is competitive (in execution time) compared to specialized differentiable solvers from past work.
Jian Qian, Ronan Fruit, Matteo Pirotta, and Alessandro Lazaric
The exploration bonus is an effective approach to manage the exploration-exploitation trade-off in Markov decision processes (MDPs). While it has been analyzed in infinite-horizon discounted and finite-horizon problems, we focus on designing and analysing the exploration bonus in the more challenging infinite-horizon undiscounted setting. We first introduce SCAL+, a variant of SCAL , that uses a suitable exploration bonus to solve any discrete unknown weakly communicating MDP for which an upper bound c on the span of the optimal bias function is known. We prove that SCAL+ enjoys the same regret guarantees as SCAL, which relies on the less-efficient extended value iteration approach. Furthermore, we leverage the flexibility provided by the exploration bonus scheme to generalize SCAL+ to smooth MDPs with continuous state space and discrete actions. We show that the resulting algorithm (SCCAL+) achieves the same regret bound as UCCRL  while being the first implementable algorithm for this setting.
Stéphane d’Ascoli, Levent Sagun, Giulio Biroli, and Joan Bruna
Despite the phenomenal success of deep neural networks in a broad range of learning tasks, there is a lack of theory to understand the way they work. In particular, convolutional neural networks (CNNs) are known to perform much better than fully connected networks (FCNs) on spatially structured data: The architectural structure of CNNs benefits from prior knowledge on the features of the data, for instance their translation invariance. The aim of this work is to understand this fact through the lens of dynamics in the loss landscape.
We introduce a method that maps a CNN to its equivalent FCN (denoted as eFCN). Such an embedding enables the comparison of CNN and FCN training dynamics directly in the FCN space. We use this method to test a new training protocol, which consists in training a CNN, embedding it to FCN space at a certain “relax time,” then resuming the training in FCN space. We observe that for all relax times, the deviation from the CNN subspace is small, and the final performance reached by the eFCN is higher than that reachable by a standard FCN of same architecture. More surprisingly, for some intermediate relax times, the eFCN outperforms the CNN it stemmed, by combining the prior information of the CNN and the expressivity of the FCN in a complementary way. The practical interest of our protocol is limited by the very large size of the highly sparse eFCN. However, it offers interesting insights into the persistence of architectural bias under stochastic gradient dynamics. It shows the existence of some rare basins in the FCN loss landscape associated with very good generalization. These can only be accessed thanks to the CNN prior, which helps navigate the landscape during the early stages of optimization.
Hugo Touvron, Andrea Vedaldi, Matthijs Douze, and Hervé Jegou
Data augmentation is key to the training of neural networks for image classification. This paper first shows that existing augmentations induce a significant discrepancy between the size of the objects seen by the classifier at train and test time: In fact, a lower train resolution improves the classification at test time!
We then propose a simple strategy to optimize the classifier performance that employs different train and test resolutions. It relies on a computationally cheap fine-tuning of the network at the test resolution. This enables training strong classifiers using small training images, and therefore significantly reduce the training time. For instance, we obtain 77.1% top-1 accuracy on ImageNet with a ResNet50 trained on 128×128 images, and 79.8% with one trained at 224×224.
A ResNeXt-101 32x48d pretrained with weak supervision on 940 million 224×224 images and further optimized with our technique for test resolution 320×320 achieves 86.4% top-1 accuracy (top-5: 98.0%). To the best of our knowledge this is the highest ImageNet single-crop accuracy to date.
Mahmoud Assran, Joshua Romoff, Nicolas Ballas, Joelle Pineau, and Mike Rabbat
Multi-simulator training has contributed to the recent success of deep reinforcement learning by stabilizing learning and allowing for higher training throughputs. We propose gossip-based actor-learner architectures (GALA) where several actor-learners (such as A2C agents) are organized in a peer-to-peer communication topology, and exchange information through asynchronous gossip in order to take advantage of a large number of distributed simulators. We prove that GALA agents remain within an ε-ball of one another during training when using loosely coupled asynchronous communication. By reducing the amount of synchronization between agents, GALA is more computationally efficient and scalable compared to A2C, its fully synchronous counterpart. GALA also outperforms A3C, being more robust and sample efficient. We show that we can run several loosely coupled GALA agents in parallel on a single GPU and achieve significantly higher hardware utilization and frame-rates than vanilla A2C at comparable power draws.
Hengyuan Hu, Denis Yarats, Qucheng Gong, Yuandong Tian, and Mike Lewis
We explore using latent natural language instructions as an expressive and compositional representation of complex actions for hierarchical decision making. Rather than directly selecting micro-actions, our agent first generates a latent plan in natural language, which is then executed by a separate model. We introduce a challenging real-time strategy game environment in which the actions of a large number of units must be coordinated across long time scales. We gather a data set of 76,000 pairs of instructions and executions from human play, and train instructor and executor models. Experiments show that models using natural language as a latent variable significantly outperform models that directly imitate human actions. The compositional structure of language proves crucial to its effectiveness for action representation. We also release our code, models, and data.
Eliya Nachmani and Lior Wolf
Neural decoders were shown to outperform classical message passing techniques for short BCH codes. In this work, we extend these results to much larger families of algebraic block codes, by performing message passing with graph neural networks. The parameters of the subnetwork at each variable node in the Tanner graph are obtained from a hypernetwork that receives the absolute values of the current message as input. To add stability, we employ a simplified version of the arctanh activation that is based on a high-order Taylor approximation of this activation function. Our results show that for a large number of algebraic block codes, from diverse families of codes (BCH, LDPC, Polar), the decoding obtained with our method outperforms the vanilla belief propagation method as well as other learning techniques from the literature.
Qi Liu, Maximilian Nickel, and Douwe Kiela
Learning from graph-structured data is an important task in machine learning and artificial intelligence, for which graph neural networks (GNNs) have shown great promise. Motivated by recent advances in geometric representation learning, we propose a novel GNN architecture for learning representations on Riemannian manifolds with differentiable exponential and logarithmic maps. We develop a scalable algorithm for modeling the structural properties of graphs, comparing Euclidean and hyperbolic geometry. In our experiments, we show that hyperbolic GNNs can lead to substantial improvements on various benchmark data sets.
Gedas Bertasius, Christoph Feichtenhofer, Du Tran, Jianbo Shi, and Lorenzo Torresani
Modern approaches for multi-person pose estimation in video require large amounts of dense annotations. However, labeling every frame in a video is costly and labor intensive. To reduce the need for dense annotations, we propose a PoseWarper network that leverages training videos with sparse annotations (every k frames) to learn to perform dense temporal pose propagation and estimation. Given a pair of video frames — a labeled Frame A and an unlabeled Frame B — we train our model to predict human pose in Frame A using the features from Frame B by means of deformable convolutions to implicitly learn the pose warping between A and B. We demonstrate that we can leverage our trained PoseWarper for several applications. First, at inference time we can reverse the application direction of our network in order to propagate pose information from manually annotated frames to unlabeled frames. This makes it possible to generate pose annotations for the entire video given only a few manually labeled frames. Compared to modern label propagation methods based on optical flow, our warping mechanism is much more compact (6M vs. 39M parameters), and also more accurate (88.7% mAP vs. 83.8% mAP). We also show that we can improve the accuracy of a pose estimator by training it on an augmented data set obtained by adding our propagated poses to the original manual labels. Lastly, we can use our PoseWarper to aggregate temporal pose information from neighboring frames during inference. This allows us to obtain state-of-the-art pose detection results on PoseTrack2017 and PoseTrack2018 data sets.
Xinyun Chen and Yuandong Tian
Search-based methods for hard combinatorial optimization are often guided by heuristics. Tuning heuristics in various conditions and situations is often time-consuming. In this paper, we propose NeuRewriter, which learns a policy to pick heuristics and rewrite the local components of the current solution to iteratively improve it until convergence. The policy factorizes into a region-picking and a rule-picking component, each parameterized by a neural network trained with actor-critic methods in reinforcement learning. NeuRewriter captures the general structure of combinatorial problems and shows strong performance in three versatile tasks: expression simplification, online job scheduling, and vehicle routing problems. NeuRewriter outperforms the expression simplification component in Z3 ; outperforms DeepRM  and Google OR-tools  in online job scheduling; and outperforms recent neural baselines [35, 29] and Google OR-tools  in vehicle routing problems.
Jiatao Gu, Changhan Wang, Junbo Zhao
Modern neural sequence generation models are built to either generate tokens step-by-step from scratch or (iteratively) modify a sequence of tokens bounded by a fixed length. In this work, we develop Levenshtein Transformer, a new partially autoregressive model devised for more flexible and amenable sequence generation. Unlike previous approaches, the basic operations of our model are and deletion. The combination of them facilitates not only generation but also sequence refinement allowing dynamic length changes. We also propose a set of new training techniques dedicated at them, effectively exploiting one as the other’s learning signal thanks to their complementary nature. Experiments applying the proposed model achieve comparable or even better performance with much-improved efficiency on both generation (e.g., machine translation, text summarization) and refinement tasks (e.g., automatic post-editing). We further confirm the flexibility of our model by showing a Levenshtein Transformer trained by machine translation can straightforwardly be used for automatic post-editing.
Andrea Zanette, Alessandro Lazaric, Mykel J. Kochenderfer, and Emma Brunskill
We study linear approximate value iteration (LAVI) with a generative model. While linear models may accurately represent the optimal value function using a few parameters, several empirical and theoretical studies show the combination of least-squares projection with the Bellman operator may be expansive, thus leading LAVI to amplify errors over iterations and eventually diverge. We introduce an algorithm that approximates value functions by combining Q-values estimated at a set of anchor states. Our algorithm tries to balance the generalization and compactness of linear methods with the small amplification of errors typical of interpolation methods. We prove that if the features at any state can be represented as a convex combination of features at the anchor points, then errors are propagated linearly over iterations (instead of exponentially) and our method achieves a polynomial sample complexity bound in the horizon and the number of anchor points. These findings are confirmed in preliminary simulations in a number of simple problems where a traditional least-square LAVI method diverges.
In this work we propose a differential geometric motivation for Nesterov’s accelerated gradient method (AGM) for strongly convex problems. By considering the optimization procedure as occurring on a Riemannian manifold with a natural structure, the AGM method can be seen as the proximal point method applied in this curved space. This viewpoint can also be extended to the continuous time case, where the accelerated gradient method arises from the natural block-implicit Euler discretization of an ODE on the manifold. We provide an analysis of the convergence rate of this ODE for quadratic objectives.
Aaron Defazio and Leon Bottou
The application of stochastic variance reduction to optimization has shown remarkable recent theoretical and practical success. The applicability of these techniques to the hard nonconvex optimization problems encountered during training of modern deep neural networks is an open problem. We show that naive application of the SVRG technique and related approaches fail, and explore why.
Ari Morcos, Haonan Yu, Michela Paganini, and Yuandong Tian
The success of lottery ticket initializations  suggests that small, sparsified networks can be trained so long as the network is initialized appropriately. Unfortunately, finding these “winning ticket” initializations is computationally expensive. One potential solution is to reuse the same winning tickets across a variety of data sets and optimizers. However, the generality of winning ticket initializations remains unclear. Here, we attempt to answer this question by generating winning tickets for one training configuration (optimizer and data set) and evaluating their performance on another configuration. Perhaps surprisingly, we found that, within the natural images domain, winning ticket initializations generalized across a variety of data sets, including Fashion MNIST, SVHN, CIFAR-10/100, ImageNet, and Places365, often achieving performance close to that of winning tickets generated on the same data set. Moreover, winning tickets generated using larger data sets consistently transferred better than those generated using smaller data sets. We also found that winning ticket initializations generalize across optimizers with high performance. These results suggest that winning ticket initializations generated by sufficiently large data sets contain inductive biases generic to neural networks more broadly which improve training across many settings and provide hope for the development of better initialization methods.
David Novotny, Benjamin Graham, and Jeremy Reizenstein
Given a set of a reference RGBD views of an indoor environment, and a new viewpoint, our goal is to predict the view from that location. Prior work on new view generation has predominantly focused on significantly constrained scenarios, typically involving artificially rendered views of isolated CAD models. Here we tackle a much more challenging version of the problem. We devise an approach that exploits known geometric properties of the scene (per-frame camera extrinsics and depth) in order to warp reference views into the new ones. The defects in the generated views are handled by a novel RGBD inpainting network, PerspectiveNet, that is fine-tuned for a given scene in order to obtain images that are geometrically consistent with all the views in the scene camera system. Experiments conducted on the ScanNet and SceneNet data sets reveal performance superior to strong baselines.
Anton Bakhtin, Laurens van der Maaten, Justin Johnson, Laura Gustafson, and Ross Girshick
Understanding and reasoning about physics is an important ability of intelligent agents. We develop the PHYRE benchmark for physical reasoning that contains a set of simple classical mechanics puzzles in a 2D physical environment. The benchmark is designed to encourage the development of learning algorithms that are sample-efficient and generalize well across puzzles. We test several modern learning algorithms on PHYRE and find that these algorithms fall short in solving the puzzles efficiently. We expect that PHYRE will encourage the development of novel sample-efficient agents that learn efficient but useful models of physics. For code and to play PHYRE for yourself, please visit https://player.phyre.ai/.
Adam Paszke, Sam Gross, Francisco Massa, Adam Lerer, James Bradbury, Gregory Chanan, Trevor Killeen, Zeming Lin, Natalia Gimelshein, Luca Antiga, Alban Desmaison, Andreas Kopf, Edward Yang, Zachary DeVito, Martin Raison, Alykhan Tejani, Sasank Chilamkurthy, Benoit Steiner, Lu Fang, Junjie Bai, and Soumith Chintala
Deep learning frameworks have often focused on either usability or speed, but not both. PyTorch is a machine learning library that shows that these two goals are, in fact, compatible: It provides an imperative and Pythonic programming style that supports code as a model, makes debugging easy, and is consistent with other popular scientific computing libraries, while remaining efficient and supporting hardware accelerators such as GPUs.
In this paper, we detail the principles that drove the implementation of PyTorch and how they are reflected in its architecture. We emphasize that every aspect of PyTorch is a regular Python program under the full control of its user. We also explain how the careful and pragmatic implementation of the key components of its runtime enables them to work together to achieve compelling performance. We demonstrate the efficiency of individual subsystems, as well as the overall speed of PyTorch on several common benchmarks.
Alexander Peysakhovich, Christian Kroer, and Adam Lerer
We consider the problem of using logged data to make predictions about what would happen if we changed the ‘rules of the game’ in a multi-agent system. This task is difficult because in many cases we observe actions individuals take but not their private information or their full reward functions. In addition, agents are strategic, so when the rules change, they will also change their actions. Existing methods (e.g., structural estimation, inverse reinforcement learning) assume that agents’ behavior comes from optimizing some utility or that the system is in equilibrium. They make counterfactual predictions by using observed actions to learn the underlying utility function (aka type) and then solving for the equilibrium of the counterfactual environment. This approach imposes heavy assumptions such as the rationality of the agents being observed and a correct model of the environment and agents’ utility functions. We propose a method for analyzing the sensitivity of counterfactual conclusions to violations of these assumptions, which we call robust multi-agent counterfactual prediction (RMAC). We provide a first-order method for computing RMAC bounds. We apply RMAC to classic environments in market design: auctions, school choice, and social choice.
Ronald Ortner, Matteo Pirotta, Alessandro Lazaric, Ronan Fruit, and Odalric-Ambrym Maillard
We consider the problem of online reinforcement learning when several state representations (mapping histories to a discrete state space) are available to the learning agent. At least one of these representations is assumed to induce a Markov decision process (MDP), and the performance of the agent is measured in terms of cumulative regret against the optimal policy giving the highest average reward in this MDP representation. We propose an algorithm (UCB-MS) with Õ(√T) regret in any communicating MDP. The regret bound shows that UCB-MS automatically adapts to the Markov model and improves over the currently known best bound of order Õ(T2/3).
Remi Cadene, Corentin Dancette, Hedi Ben younes, Matthieu Cord, and Devi Parikh
Visual question answering (VQA) is the task of answering questions about an image. Some VQA models often exploit unimodal biases to provide the correct answer without using the image information. As a result, they suffer from a huge drop in performance when evaluated on data outside their training set distribution. This critical issue makes them unsuitable for real-world settings. We propose RUBi, a new learning strategy to reduce biases in any VQA model. It reduces the importance of the most biased examples, i.e., examples that can be correctly classified without looking at the image. It implicitly forces the VQA model to use the two input modalities instead of relying on statistical regularities between the question and the answer. We leverage a question-only model that captures the language biases by identifying when these unwanted regularities are used. It prevents the base VQA model from learning them by influencing its predictions. This leads to dynamically adjusting the loss in order to compensate for biases. We validate our contributions by surpassing the current state-of-the-art results on VQA-CP v2. This data set is specifically designed to assess the robustness of VQA models when exposed to different question biases at test time than what was seen during training.
Our code is available here.
Alex Wang, Yada Pruksachatkun, Nikita Nangia, Amanpreet Singh, Julian Michael, Felix Hill, Omer Levy, and Samuel R. Bowman
In the last year, new models and methods for pretraining and transfer learning have driven striking performance improvements across a range of language understanding tasks. The GLUE benchmark, introduced a little over one year ago, offers a single-number metric that summarizes progress on a diverse set of such tasks, but performance on the benchmark has recently surpassed the level of nonexpert humans, suggesting limited headroom for further research. In this paper we present SuperGLUE, a new benchmark styled after GLUE with a new set of more difficult language understanding tasks, a software toolkit, and a public leaderboard. SuperGLUE is available at super.gluebenchmark.com
Pratyusha Sharma, Deepak Pathak, and Abhinav Gupta
We study a generalized setup for learning from demonstration to build an agent that can manipulate novel objects in unseen scenarios by looking at only a single video of human demonstration from a third-person perspective. To accomplish this goal, our agent should not only learn to understand the intent of the demonstrated third-person video in its context but also perform the intended task in its environment configuration. Our central insight is to enforce this structure explicitly during learning by decoupling what to achieve (intended task) from how to perform it (controller). We propose a hierarchical setup where a high-level module learns to generate a series of first-person subgoals conditioned on the third-person video demonstration, and a low-level controller predicts the actions to achieve those subgoals. Our agent acts from raw image observations without any access to the full state information. We show results on a real robotic platform using Baxter for the manipulation tasks of pouring and placing objects in a box. Project video and code are at https://pathak22.github.io/hierarchical-imitation/.
Yonathan Efroni, Nadav Merlis, Mohammad Ghavamzadeh, and Shie Mannor
State-of-the-art efficient model-based reinforcement learning (RL) algorithms typically act by iteratively solving empirical models, i.e., by performing full-planning on Markov decision processes (MDPs) built by the gathered experience. In this paper, we focus on model-based RL in the finite-state finite-horizon undiscounted MDP setting and establish that exploring with greedy policies — act by one-step planning — can achieve tight minimax performance in terms of regret, Õ(√HSAT). Thus, full planning in model-based RL can be avoided altogether without any performance degradation, and, by doing so, the computational complexity decreases by a factor of S. The results are based on a novel analysis of real-time dynamic programming, then extended to model-based RL. Specifically, we generalize existing algorithms that perform full planning to act by one-step planning. For these generalizations, we prove regret bounds with the same rate as their full-planning counterparts.
Mickaël Chen, Thierry Artieres, and Ludovic Denoyer
Object segmentation is a crucial problem that is usually solved by using supervised learning approaches over very large data sets composed of both images and corresponding object masks. Since the masks have to be provided at pixel level, building such a data set for any new domain can be very costly. We present ReDO, a new model able to extract objects from images without any annotation in an unsupervised way. It relies on the idea that it should be possible to change the textures or colors of the objects without changing the overall distribution of the data set. Following this assumption, our approach is based on an adversarial architecture where the generator is guided by an input sample: Given an image, it extracts the object mask, then redraws a new object at the same location. The generator is controlled by a discriminator that ensures that the distribution of generated images is aligned to the original one. We experiment with this method on different data sets and demonstrate the good quality of extracted masks.
Jiasen Lu, Dhruv Batra, Devi Parikh, and Stefan Lee
We present ViLBERT (short for Vision-and-Language BERT), a model for learning task-agnostic joint representations of image content and natural language. We extend the popular BERT architecture to a multi-modal two-stream model, processing both visual and textual inputs in separate streams that interact through co-attentional transformer layers. We pretrain our model through two proxy tasks on the large, automatically collected Conceptual Captions data set and then transfer it to multiple established vision-and-language tasks — visual question answering, visual commonsense reasoning, referring expressions, and caption-based image retrieval — by making only minor additions to the base architecture. We observe significant improvements across tasks compared to existing task-specific models — achieving state of the art on all four tasks. Our work represents a shift away from learning groundings between vision and language only as part of task training and toward treating visual grounding as a pretrainable and transferable capability.
Paper: A closer look at the optimization landscape of GANs
Hugo Bérard, Gauthier Gidel, Amjad Almahairi, Pascal Vincent, and Simon Lacoste-Julien
Co-organizer: Kyunghyun Cho
Co-chair: Alborz Geramifard
Invited speaker: Y-Lan Boureau
Paper: Improving robustness of task-oriented dialog systems
Arash Einolghozati, Sonal Gupta, Mrinal Mohit, and Rushin Shah
Co-organizer: Joelle Pineau
Paper: Benchmarking batch deep reinforcement learning algorithms
Scott Fujimoto, Edoardo Conti, Mohammad Ghavamzadeh, and Joelle Pineau
Paper: Data-efficient co-adaptation of morphology and behaviour with deep reinforcement learning
Kevin Sebastian Luck, Heni Ben Amor, and Roberto Calandra
Paper: Modular visual navigation using active neural mapping
Devendra Singh Chaplot, Saurabh Gupta, Abhinav Gupta, and Ruslan Salakhutdinov
Paper: Objective mismatch in model-based reinforcement learning
Nathan Lambert, Brandon Amos, Omry Yadan, and Roberto Calandra
Paper: Plan2Vec: Unsupervised representation learning by latent plans
Ge Yang, Amy Zhang, Ari Morcos, Joelle Pineau, Pieter Abbeel, and Roberto Calandra
Paper: Search in cooperative partially observable games
Adam Lerer, Hengyuan Hu, Jakob Foerster, and Noam Brown
Paper: SEERL: Sample efficient ensemble reinforcement learning
Rohan Saphal, Balaraman Ravindran, Dheevatsa Mudigere, Sasikanth Avancha, and Bharat Kaul
Grand Keynote: Yann LeCun
Paper: Energy-aware neural architecture optimization with splitting steepest descent
Dilin Wang, Lemeng Wu, Meng Li, Vikas Chandra, and Qiang Liu
Paper: Improving efficiency in neural network accelerator using operands hamming distance optimization
Meng Li, Yilei Li, Pierce Chuang, Liangzhen Lai, and Vikas Chandra
Co-organizers: Kyunghyun Cho, Douwe Kiela, and Cinjon Resnick
Co-organizer: Michela Paganini
Speaker: Eytan Bakshy
Co-organizer: Michela Paganini
Co-organizer: Roberto Calandra
Invited speaker: Brenden Lake
Co-organizer: Aparna Lakshmiratan
Paper: Mvfst-rl: An asynchronous RL framework for congestion control with delayed actions
Viswanath Sivakumar, Tim Rocktäschel, Alexander H. Miller, Heinrich Küttler, Nantas Nardelli, Mike Rabbat, Joelle Pineau, and Sebastian Riedel
Paper: Post-training 4-bit quantization on embedding tables
Hui Guan, Andrey Malevich, Jiyan Yang, Jongsoo Park, and Hector Yuen
Paper: Predictive precompute with recurrent neural networks
Hanson Wang, Zehui Wang, and Yuanyuan Ma
Speakers: Narine Kokhlikyan, William Falcon, Shubho Sengupta, Joe Spisak, and Ailing Zhang
Speakers: Raghuraman Krishnamoorthi, Xian Li, Dmytro Okhonko, Vinicius Reis, Michael Suo, Yongqiang Wang,and Yuxin Wu
Co-organizers: Jessica Forde, Michela Paganini, Joelle Pineau, Koustuv Sinha, and Shagun Sodhani
Co-organizer: Roberto Calandra
Co-organizer: Mohammad Ghavamzadeh
Paper: Improved algorithms for conservative exploration in bandits
Evrard Garcelon, Mohammad Ghavamzadeh, Alessandro Lazaric, and Matteo Pirotta
Paper: Robust identifiability in linear structural equation models for causal inference
Karthik Abinav Sankararaman, Anand Louis, and Navin Goyal
Paper: Thompson sampling for contextual bandit problems with auxiliary safety constraints
Samuel Daulton, Shaun Singh, Vashist Avadhanula, Drew Dimmery, and Eytan Bakshy
Co-organizers: Adriana Romero, Levent Sagun
Speakers: Kyunghyun Cho, Natalia Neverova
Panelists: Nafissa Yakubova, Aparna Lakshmiratan
Panel advisor: Michela Paganini
Paper: Non-Gaussian processes and neural networks at finite widths
Paper: The generalization-stability tradeoff in neural network pruning
Brian R. Bartoldson, Ari Morcos, Adrian Barbu, and Gordon Erlebacher
Paper: Training batchnorm and only batchnorm
Jonathan Frankle, David Schwab, and Ari Morcos
Co-organizer: Michela Paganini
Paper: CraftAssist: A framework for dialogue-enabled interactive agents
Kavya Srinet, Jonathan Gray, Yacine Jernite, Haonan Yu, Zhuoyuan Chen, Demi Guo, Siddharth Goyal, Larry Zitnick, and Arthur Szlam