July 10, 2020

Machine learning experts from around the world are gathering virtually for the 2020 International Conference on Machine Learning (ICML) to present the latest advances in machine learning understanding. Research from Facebook will be presented in pre-recorded videos with live Q&A sessions.

Facebook researchers will also be speaking at several virtual workshops. For example, Arthur Szlam is speaking at the first workshop on Reinforcement Learning. Tim Rocktäschel will be speaking at the Learning in Artificial Open Worlds, a workshop focused on machine learning in real-world settings. As part of our commitment to diversify the field, Facebook AI is also co-sponsoring the Women in Machine Learning Un-Workshop (where discussions are primarily driven by the participants) and Francisco Guzman is speaking at the Latinx in AI Workshop.

For those attending ICML be sure to follow our Twitter channel to stay up to date. For more information on Facebook AI’s presence at ICML, check out our website.

__Aligned cross entropy for non-autoregressive machine translation __**Marjan Ghazvininejad, Vladimir Karpukhin, Luke Zettlemoyer, Omer Levy**

Non-autoregressive machine translation models significantly speed up decoding by allowing for parallel prediction of the entire target sequence. However, modeling word order is more challenging due to the lack of autoregressive factors in the model. This difficulty is compounded during training with cross entropy loss, which can highly penalize small shifts in word order. In this paper, we propose aligned cross entropy (AXE) as an alternate loss function for training of non-autoregressive models. AXE uses a differentiable dynamic program to assign loss based on the best possible monotonic alignment between target tokens and model predictions. AXE-based non-monotonic training of conditional masked language models (CMLMs) improves performance by 3 and 5 BLEU points respectively on WMT 16 EN-RO and WMT 14 EN-DE. It also significantly outperforms the state-of-the-art non-autoregressive models on a range of translation benchmarks.

__A sequential self teaching approach for improving generalization in sound event recognition __**Anurag Kumar**,** Vamsi Krishna Ithapu**

An important problem in machine auditory perception is to recognize and detect sound events. In this paper, we propose a sequential self-teaching approach to learning sounds. Our main proposition is that it is harder to learn sounds in adverse situations such as from weakly labeled and/or noisy labeled data, and in these situations a single stage of learning is not sufficient. Our proposal is a sequential stage-wise learning process that improves generalization capabilities of a given modeling system. We justify this method via technical results and on Audioset, the largest sound events dataset, our sequential learning approach can lead to up to 9% improvement in performance. A comprehensive evaluation also shows that the method leads to improved transferability of knowledge from previously trained models, thereby leading to improved generalization capabilities on transfer learning tasks.

__Certified data removal from machine learning models__

Chuan Guo, Thomas Goldstein, **Awni Hannun**, **Laurens van der Maaten**

Good data stewardship requires removal of data at the request of the data’s owner. This raises the question of whether and how a trained machine learning model, which implicitly stores information about its training data, should be affected by such a removal request. Is it possible to “remove” data from a machine learning model? We study this problem by defining certified removal: a very strong theoretical guarantee that a model from which data is removed cannot be distinguished from a model that never observed the data to begin with. We develop a certified-removal mechanism for linear classifiers and empirically study learning settings in which this mechanism is practical.

__Constrained Markov decision processes via reverse value functions __

Harsh Satija, Philip Amortila, **Joelle Pineau**

Although reinforcement learning (RL) algorithms have found tremendous success in simulated domains, they often cannot directly be applied to physical systems, especially in cases where there are hard constraints to satisfy (e.g., on safety or resources). In standard RL, the agent is incentivized to explore any behavior as long as it maximizes rewards, but in the real world, undesired behavior can damage either the system or the agent in a way that breaks the learning process itself. In this work, we model the problem of learning with constraints as a Constrained Markov Decision Process and provide a new on-policy formulation for solving it. A key contribution of our approach is to translate cumulative cost constraints into state-based constraints. Through this, we define a safe policy improvement algorithm, which maximizes returns while ensuring that the constraints are satisfied at every step. We provide theoretical guarantees under which the agent converges while ensuring safety over the course of training. We also highlight computational advantages of this approach. The effectiveness of our approach is demonstrated on safe navigation tasks and in safety-constrained versions of MuJoCo environments, with deep neural networks.

__Differentiating through the Fréchet mean __**Aaron Lou, **Isay Katsman, Qingxuan Jiang, Serge Belongie, **Ser Nam Lim**, Christopher De Sa

Recent advances in deep representation learning on Riemannian manifolds extend classical deep learning operations to better capture the geometry of the manifold. One possible extension is the Fréchet mean, the generalization of the Euclidean mean; however, it has been difficult to apply because it lacks a closed form with an easily computable derivative. In this paper, we show how to differentiate through the Fréchet mean for arbitrary Riemannian manifolds. Then, focusing on hyperbolic space, we derive explicit gradient expressions and a fast, accurate, and hyperparameter-free Fréchet mean solver. This fully integrates the Fréchet mean into the hyperbolic neural network pipeline. To demonstrate this integration, we present two case studies. First, we apply our Fréchet mean to the existing Hyperbolic Graph Convolutional Network, replacing its projected aggregation to obtain state-of-the-art results on datasets with high hyperbolicity. Second, to demonstrate the Fréchet mean’s capacity to generalize Euclidean neural network operations, we develop a hyperbolic batch normalization method that gives an improvement parallel to the one observed in the Euclidean setting.

__Efficient optimistic exploration in linear-quadratic regulators via Lagrangian relaxation __

Marc Abeille, **Alessandro Lazaric**

We study the exploration-exploitation dilemma in the linear quadratic regulator (LQR) setting. Inspired by the extended value iteration algorithm used in optimistic algorithms for finite MDPs, we propose to relax the optimistic optimization of \ofulq and cast it into a constrained \textit{extended} LQR problem, where an additional control variable implicitly selects the system dynamics within a confidence interval. We then move to the corresponding Lagrangian formulation for which we prove strong duality. As a result, we show that an $\epsilon$-optimistic controller can be computed efficiently by solving at most $O\big(\log(1/\epsilon)\big)$ Riccati equations. Finally, we prove that relaxing the original \ofu problem does not impact the learning performance, thus recovering the $\wt O(\sqrt{T})$ regret of \ofulq. To the best of our knowledge, this is the first computationally efficient confidence-based algorithm for LQR with worst-case optimal regret guarantees.

__Entropy minimization in emergent languages __**Evgeny Rahma**, **Diane Marco**

There is growing interest in studying the languages emerging when neural agents are jointly trained to solve tasks requiring communication through a discrete channel. We investigate here the information-theoretic complexity of such languages, focusing on the basic two-agent, one-exchange setup. We find that, under common training procedures, the emergent languages are subject to an entropy minimization pressure that has also been detected in human language, whereby the mutual information between the communicating agent’s inputs and the messages are minimized, within the range afforded by the need for successful communication. This pressure is amplified as we increase communication channel discreteness. Further, we observe that stronger discrete-channel-driven entropy minimization leads to representations with increased robustness to overfitting and adversarial attacks. We conclude by discussing the implications of our findings for the study of natural and artificial communication systems.

__Fully parallel hyperparameter search: Reshaped space-filling__**Camille Couprie**, **Olivier Teytaud**, **Jérémy Rapin**, **Morgane Rivière**, **Nicolas Usunie**

Random search is the most classical fully parallel hyperparameter search method, outperforming grid search and almost equivalent to sophisticated methods, namely space-filling designs. We prove that many methods are actually equivalent up to a constant. Based on these results, considering the consistent but moderate improvement obtained by space-filling designs which preserve the same search distribution and just relax independence, we propose to reshape the search distribution. With an optimum normally distributed, we show that in high dimension the search distribution limited to a Dirac at 0 (all samples are equal, so cardinal 1) is actually better than using a sample of cardinal exponential in the dimension drawn from that same normal distribution. We deduce simple modifications of samplers, including a reshaping which won the Facebook CEC competition of one-shot optimization and repeatedly outperforms other methods when the prior probability distribution of the optimum is known. In the case of unknown prior probability distribution for the optimum, Cauchy counterparts turn out to perform best.

__Graph structure of neural networks__

Jiaxuan You, Jure Leskovec, **Kaiming He**, **Saining Xie**

Neural networks are often represented as graphs of connections between neurons. However, despite their wide use, there is currently little understanding of the relationship between the graph structure of the neural network and its predictive performance. Here, we systematically investigate how does the graph structure of neural networks affect their predictive performance. To this end, we develop a novel graph-based representation of neural networks called relational graph, where layers of neural network computation correspond to rounds of message exchange along the graph structure. Using this representation, we show that: (1) graph structure of neural networks matters; (2) a “sweet spot” of relational graphs lead to neural networks with significantly improved predictive performance; (3) neural network’s performance is approximately a smooth function of the clustering coefficient and average path length of its relational graph; (4) our findings are consistent across many different tasks and datasets; (5) top architectures can be identified efficiently; (6) well-performing neural networks have graph structure surprisingly similar to those of real biological neural networks. Our work opens new directions for the design of neural architectures and the understanding on neural networks in general.

__Growing action spaces__**Gregory Farquhar**, **Laura Gustafson**, **Zeming Lin**, Shimon Whiteson, **Nicolas Usunier**, **Gabriel Synnaeve**

In complex tasks, such as those with large combinatorial action spaces, random exploration may be too inefficient to achieve meaningful learning progress. In this work, we use a curriculum of progressively growing action spaces to accelerate learning. We assume the environment is out of our control but that the agent may set an internal curriculum by initially restricting its action space. Our approach uses off-policy reinforcement learning to estimate optimal value functions for multiple action spaces simultaneously, and efficiently transfers data, value estimates, and state representations from restricted action spaces to the full task. We show the efficacy of our approach in proof-of-concept control tasks and on challenging large-scale StarCraft micromanagement tasks with large, multiagent action spaces.

__Interference and generalization in temporal difference learning __

Emmanuel Bengio, **Joelle Pineau**, Doina Precup

We study the link between generalization and interference in temporal-difference (TD) learning. Interference is defined as the inner product of two different gradients, representing their alignment; this quantity emerges as being of interest from a variety of observations about neural networks, parameter sharing, and the dynamics of learning. We find that TD easily leads to low-interference, under-generalizing parameters, while the effect seems reversed in supervised learning. We hypothesize that the cause can be traced back to the interplay between the dynamics of interference and bootstrapping. This is supported empirically by several observations: the negative relationship between the generalization gap and interference in TD, the negative effect of bootstrapping on interference and the local coherence of targets, and the contrast between the propagation rate of information in TD(0) versus TD(λ) and regression tasks such as Monte-Carlo policy evaluation. We hope that these new findings can guide the future discovery of better bootstrapping methods.

__Invariant causal prediction for block MDPs __**Amy Zhang**, Clare Lyle, **Shagun Sodhani**, Angelos Filos, Marta Kwiatkowska, **Joelle Pineau**, Yarin Gal, Doina Precup

Generalization across environments is critical for the successful application of reinforcement learning algorithms to real-world challenges. In this paper, we consider the problem of learning abstractions that generalize in block MDPs, families of environments with a shared latent state space, and dynamics structure over that latent space, but varying observations. We leverage tools from causal inference to propose a method of invariant prediction to learn state abstractions that generalize to novel observations in the multi-environment setting. We prove that for certain classes of environments, this approach outputs with high probability a state abstraction corresponding to the causal feature set with respect to the return. We further provide more general bounds on model error and generalization error in the multi-environment setting in the process showing a connection between causal variable selection and the state abstraction framework for MDPs. We give empirical evidence that our methods work in both linear and nonlinear settings, attaining improved generalization over single- and multi-task baselines.

Learning near optimal policies with low inherent Bellman error

Andrea Zanette, **Alessandro Lazaric**, Mykel Kochenderfer, Emma Brunskill

We study the exploration problem with approximate linear action-value functions in episodic reinforcement learning under the notion of low inherent Bellman error, a condition normally employed to show convergence of approximate value iteration. First, we relate this condition to other common frameworks and show that it is strictly more general than the low rank (or linear) MDP assumption of prior work. Second, we provide an algorithm with a high probability regret bound $\widetilde O(\sum_{t=1}^H d_t \sqrt{K} + \sum_{t=1}^H \sqrt{d_t} \IBE K)$ where H is the horizon, K is the number of episodes, $\IBE$ is the value if the inherent Bellman error, and dt is the feature dimension at timestep t. In addition, we show that the result is unimprovable beyond constants and logs by showing a matching lower bound. This has two important consequences: 1) The algorithm has the optimal statistical rate for this setting, which is more general than prior work on low-rank MDPs; and 2) the lack of closedness (measured by the inherent Bellman error) is only amplified by dt‾‾√ despite working in the online setting. Finally, the algorithm reduces to the celebrated LINUCB when H = 1 but with a different choice of the exploration parameter that allows handling misspecified contextual linear bandits. While computational tractability questions remain open for the MDP setting, this enriches the class of MDPs with a linear representation for the action-value function where statistically efficient reinforcement learning is possible.

__Learning robot skills with temporal variational inference__**Tanmay Shankar**, **Abhinav Gupta**

In this paper, we address the discovery of robotic options from demonstrations in an unsupervised manner. Specifically, we present a framework to jointly learn low-level control policies and higher-level policies of how to use them from demonstrations of a robot performing various tasks. By representing options as continuous latent variables, we frame the problem of learning these options as latent variable inference. We then present a temporal formulation of variational inference based on a temporal factorization of trajectory likelihoods, which allows us to infer options in an unsupervised manner. We demonstrate the ability of our framework to learn such options across three robotic demonstration datasets.

__Meta-learning with shared amortized variational inference __

Ekaterina Iakovleva, **Jakob Verbeek**, Karteek Alahari

We propose a novel scheme for amortized variational inference for an empirical Bayes meta-learning model, where model parameters are treated as latent variables. We learn the prior distribution over model parameters conditioned on limited training data using a variational autoencoder approach, and share the same amortized inference network between the conditional prior and variational posterior distribution over the model parameters. While the posterior leverages both the labeled support and query data, the conditional prior is based on the labeled support data only. We show that in earlier approaches based on Monte-Carlo approximation, the conditional prior collapses to a Dirac delta function. In contrast, our variational approach prevents this collapse and preserves uncertainty over the model parameters. We evaluate our approach on the miniImageNet and FC100 datasets, and we present results demonstrating the advantage of our approach over previous work.

__Meta-learning in stochastic linear bandits__

Leonardo Cella, **Alessandro Lazaric**, Massimiliano Pontil

We investigate meta-learning procedures in the setting of stochastic linear bandits tasks. The goal is to select a learning algorithm which works well on average over a class of bandits tasks, that are sampled from a task distribution. Inspired by recent work on learning-to-learn linear regression, we consider a class of bandit algorithms that implement a regularized version of the well-known OFUL algorithm, where the regularization is a square Euclidean distance to a bias vector. We first study the benefit of the biased OFUL algorithm in terms of regret minimization. We then propose two strategies to estimate the bias within the learning-to-learn setting. We show both theoretically and experimentally, that when the number of tasks grows and the variance of the task distribution is small, our strategies have a significant advantage over learning the tasks in isolation.

__Near-linear time Gaussian process optimization with adaptive batching and resparsification __**Alessandro Lazaric**, Daniele Calandriello, Luigi Carratino, Alessandro Lazaric, Michal Valko, Lorenzo Rosasco.

Gaussian processes (GP) are one of the most successful frameworks to model uncertainty. However, GP optimization (e.g., GP-UCB) suffers from major scalability issues. Experimental time grows linearly with the number of evaluations, unless candidates are selected in batches (e.g., using GP-BUCB) and evaluated in parallel. Furthermore, computational cost is often prohibitive since algorithms such as GP-BUCB require a time at least quadratic in the number of dimensions and iterations to select each batch. In this paper, we introduce BBKB (Batch Budgeted Kernel Bandits), the first no-regret GP optimization algorithm that provably runs in near-linear time and selects candidates in batches. This is obtained with a new guarantee for the tracking of the posterior variances that allows BBKB to choose increasingly larger batches, improving over GP-BUCB. Moreover, we show that the same bound can be used to adaptively delay costly updates to the sparse GP approximation used by BBKB, achieving a near-constant per-step amortized cost. These findings are then confirmed in several experiments, where BBKB is much faster than state-of-the-art methods.

__No-regret exploration in goal-oriented reinforcement learning__**Jean Tarbouriech**, **Evrard Garcelon**, Michal Valko, **Matteo Pirotta**, **Alessandro Lazaric**

Many popular reinforcement learning problems (e.g., navigation in a maze, some Atari games, mountain car) are instances of the episodic setting under its stochastic shortest path (SSP) formulation, where an agent has to achieve a goal state while minimizing the cumulative cost. Despite the popularity of this setting, the exploration exploitation dilemma has been sparsely studied in general SSP problems, with most of the theoretical literature focusing on different problems (e.g., fixed horizon and infinite horizon) or making the restrictive loop-free SSP assumption (i.e., no state can be visited twice during an episode). In this paper, we study the general SSP problem with no assumption on its dynamics (some policies may actually never reach the goal). We introduce UC-SSP, the first no-regret algorithm in this setting, and prove a regret bound scaling as Oe(DS√ ADK) after K episodes for any unknown SSP with S states, A actions, positive costs, and SSP-diameter D, defined as the smallest expected hitting time from any starting state to the goal. We achieve this result by crafting a novel stopping rule, such that UC-SSP may interrupt the current policy if it is taking too long to achieve the goal and switch to alternative policies that are designed to rapidly terminate the episode.

__Online learned continual compression with adaptive quantization modules __**Lucas Caccia**, Eugene Belilovsky, Massimo Caccia, **Joelle Pineau**

We introduce and study the problem of Online Continual Compression, where one attempts to simultaneously learn to compress and store a representative dataset from a non i.i.d data stream, while only observing each sample once. A naive application of auto-encoder in this setting encounters a major challenge: Representations derived from earlier encoder states must be usable by later decoder states. We show how to use discrete autoencoders to effectively address this challenge and introduce adaptive quantization modules (AQM) to control variation in the compression ability of the module at any given stage of learning. This enables selecting an appropriate compression for incoming samples, while taking into account overall memory constraints and current progress of the learned compression. Unlike previous methods, our approach does not require any pretraining, even on challenging datasets. We show that using AQM to replace standard episodic memory in continual learning settings leads to significant gains on continual learning benchmarks with images, LiDAR, and reinforcement learning agents.

__ On the Convergence of Nesterov’s Accelerated Gradient Method in Stochastic Settings __**Mido Assran, Michael Rabbat**

We study Nesterov’s accelerated gradient method with constant step-size and momentum parameters in the stochastic approximation setting (unbiased gradients with bounded variance) and the finite-sum setting (where randomness is due to sampling mini-batches). To build better insight into the behavior of Nesterov’s method in stochastic settings, we focus throughout on objectives that are smooth, strongly-convex, and twice continuously differentiable. In the stochastic approximation setting, Nesterov’s method converges to a neighborhood of the optimal point at the same accelerated rate as in the deterministic setting. Perhaps surprisingly, in the finite-sum setting, we prove that Nesterov’s method may diverge with the usual choice of step-size and momentum, unless additional conditions on the problem related to conditioning and data coherence are satisfied. Our results shed light as to why Nesterov’s method may fail to converge or achieve acceleration in the finite-sum setting.

__“Other-play” for zero-shot coordination__**Hengyuan Hu**, **Adam Lerer**, **Alex Peysakhovich**, **Jakob Foerster**

We consider the problem of zero-shot coordination — constructing AI agents that can coordinate with novel partners they have not seen before (e.g., humans). Standard multiagent reinforcement learning (MARL) methods typically focus on the self-play (SP) setting, where agents construct strategies by playing the game with themselves repeatedly. Unfortunately, applying SP naively to the zero-shot coordination problem can produce agents that establish highly specialized conventions that do not carry over to novel partners they have not been trained with. We introduce a novel learning algorithm called other-play (OP), which enhances self-play by looking for more robust strategies. We characterize OP theoretically as well as experimentally. We study the cooperative card game Hanabi and show that OP agents achieve higher scores than SP agents when paired with independently trained agents as well as with human players.

__Stochastic Hamiltonian gradient methods for smooth games __

Nicolas Loizou, **Hugo Berard**, Alexia Jolicoeur-Martineau, **Pascal Vincent**, Simon Lacoste-Julien, Ioannis Mitliagkas

The analysis of smooth games has attracted attention, motivated by the success of adversarial formulations. The Hamiltonian method is a lightweight second-order approach that recasts the problem in terms of a minimization objective. Consensus optimization can be seen as a generalization: It mixes a Hamiltonian term with the original game dynamics. This family of Hamiltonian methods has shown promise in literature. However, they come with no guarantees for stochastic games. Classic stochastic extragradient and mirror-prox methods require averaging over a compact domain to achieve convergence. Recent variance-reduced first-order schemes focus on unbounded domains, but stop short of proving last-iterate convergence for bilinear matrix games. We analyze the stochastic Hamiltonian method and a novel variance-reduced variant of it and provide the first set of last-iterate convergence guarantees for stochastic unbounded bilinear games. More generally, we provide convergence guarantees for a family of stochastic games, notably including some nonconvex ones. We supplement our analysis with experiments on a stochastic bilinear game, where our theory is shown to be tight, and simple adversarial machine learning formulations.

__Student Specialization in Deep Rectified Networks With Finite Width and Input Dimension__**Yuandong Tian**

We consider a deep ReLU/Leaky ReLU student network trained from the output of a fixed teacher network of the same depth, with stochastic gradient descent (SGD). The student network is \emph{over-realized}: At each layer l, the number nl of student nodes is more than that (ml) of teacher. Under mild conditions on dataset and teacher network, we prove that when the gradient is small at every data sample, each teacher node is \emph{specialized} by at least one student node \emph{at the lowest layer}. For two-layer network, such specialization can be achieved by training on any dataset of \emph{polynomial} size (K5/2d3ϵ−1). until the gradient magnitude drops to (ϵ/K3/2d‾‾√). Here, d is the input dimension, and K=m1+n1 is the total number of neurons in the lowest layer of teacher and student. Note that we require a specific form of data augmentation, and the sample complexity includes the additional data generated from augmentation. To our best knowledge, we are the first to give polynomial sample complexity for student specialization of training two-layer (Leaky) ReLU networks with finite depth and width in teacher-student setting, and finite complexity for the lowest layer specialization in multilayer case, without parametric assumption of the input (like Gaussian). Our theory suggests that teacher nodes with large fan-out weights get specialized first, when the gradient is still large, while others are specialized with small gradient, which suggests inductive bias in training. This shapes the stage of training as empirically observed in multiple previous works. Experiments on synthetic and CIFAR10 verify our findings. The code is https://github.com/facebookresearch/luckmattersreleased on GitHub.

__The differentiable cross-entropy method__**Brandon Amos**, **Denis Yarats**

We study the cross-entropy method (CEM) for the nonconvex optimization of a continuous and parameterized objective function and introduce a differentiable variant that enables us to differentiate the output of CEM with respect to the objective function’s parameters. In the machine learning setting, this brings CEM inside of the end-to-end learning pipeline, where this has otherwise been impossible. We show applications in a synthetic energy-based structured prediction task and in nonconvex continuous control. In the control setting, we show how to embed optimal action sequences into a lower-dimensional space. This enables us to use policy optimization to fine-tune modeling components by differentiating through the CEM-based controller.

__Parallel machine translation with disentangled context transformer__

Jungo Kasai, **James Cross**, **Marjan Ghazvininejad**, and **Jiatao Gu**

State-of-the-art neural machine translation models generate a translation from left to right, and every step is conditioned on the previously generated tokens. The sequential nature of this generation process causes fundamental latency in inference since we cannot generate multiple tokens in each sentence in parallel. We propose an attention-masking based model, called disentangled context (DisCo) transformer, that simultaneously generates all tokens given different contexts. The DisCo transformer is trained to predict every output token given an arbitrary subset of the other reference tokens. We also develop the parallel easy-first inference algorithm, which iteratively refines every token in parallel and reduces the number of required iterations. Our extensive experiments on seven directions with varying data sizes demonstrate that our model achieves competitive, if not better, performance compared with the state of the art in non-autoregressive machine translation while significantly reducing decoding time on average.

__Voice separation with an unknown number of multiple speakers__**Eliya Nachmani**, **Yossef Mordechay Adi**, **Lior Wolf**

We present a new method for separating a mixed audio sequence, in which multiple voices speak simultaneously. The new method employs gated neural networks that are trained to separate the voices at multiple processing steps, while maintaining the speaker in each output channel fixed. A different model is trained for every number of possible speakers, and the model with the largest number of speakers is employed to select the actual number of speakers in a given sample. Our method greatly outperforms the current state of the art, which, as we show, is not competitive for more than two speakers.

__Word-level speech recognition with a letter to word encoder __**Ronan Collobert**, **Awni Hannun**, **Gabriel Synnaeve**

We propose a direct-to-word sequence model which uses a word network to learn word embeddings from letters. The word network can be integrated seamlessly with arbitrary sequence models including connectionist temporal classification and encoder-decoder models with attention. We show our direct-to-word model can achieve word error rate gains over sub-word level models for speech recognition. We also show that our direct-to-word approach retains the ability to predict words not seen at training time without any retraining. Finally, we demonstrate that a word-level model can use a larger stride than a sub-word level model while maintaining accuracy. This makes the model more efficient both for training and inference.

Latinx in AI Workshop, July 13, 7:30 am EDT

Francisco Guzman is a speaker

Women in Machine Learning Un-Workshop, July 13, 6:00 GMT

Facebook AI is sponsoring this workshop. Kalesha Bullard is speaking.

PyTorch Live Q&A sessions, July 15

There is a full day of PyTorch live Q&A sessions. You can check out the schedule by visiting the Facebook AI virtual booth.

Workshop on Continual Learning, July 17

David Lopez-Paz is part of the organizing committee.

Self-supervision in Audio and Speech, July 17, 7:05 AM GMT

Lorenzo Torresani is a speaker.

Workshop on eXtreme Classification: Theory and Applications, July 17, 9am EST

Tomas Mikolov is a speaker

1st Workshop on Language in Reinforcement Learning (LaReL), July 18, 10 am EST

Jakob Foerster, Edward Grefenstette, and Tim Rocktäschel are part of the organizing committee.

Arthur Szlam is a speaker.

Lifelong Learning Workshop, July 18, 5am EST

Shagun Sodhani and Koustuv Sinha are part of the organizing committee.

Workshop on Learning in Artificial Open Worlds, July 18, 10am EST

Kavya Srinet and Arthur Szlam are part of the organizing committee.

Tim Rocktäschel is a speaker.

MLRetrospectives: A Venue for Self-Reflection in ML Research, July 18, 2020, 9am EST

Joelle Pineau is an organizer.