RESEARCH

ML APPLICATIONS

Meta AI research at NeurIPS 2021: Embodied agents, unsupervised speech recognition, and more

December 6, 2021

We’re excited to share that Meta AI researchers will be presenting 83 papers at NeurIPS 2021, including eight as spotlights and five as orals and one paper received an Outstanding Paper Award. Our researchers also have helped co-organize six NeurIPS workshops and five challenges, and they will be giving several invited and contributed talks at workshops.

Collaborating with the AI community through challenges and workshops

Meta AI is a proud sponsor of two NeurIPS affinity group workshops, LatinX and WiML, and the Black in AI organization.

We’ve also helped organize five challenges at NeurIPS this year:

Meta AI researchers also contributed to organizing six workshops:

Advancing the state of the art with new research

Below, we would like to showcase some highlights of our research at NeurIPS, roughly grouped into four themes: embodied agents and efficient exploration, speech and NLP, understanding the world from visual data, and generative models and foundations of representation learning. The full list of accepted papers from Meta AI is available here.

Embodied agents and efficient exploration

Something Went Wrong
We're having trouble playing this video.

Reinforcement learning (RL) agents learn by interacting with their environment — often a simulated real-world space so that trials can be carried out much more quickly, safely, and efficiently. Our contributions under this theme include Habitat 2.0, a new 3D photorealistic simulation platform in which agents can both navigate through the environment and interact with objects. Our work also addresses fundamental questions regarding efficient exploration in a variety of settings, such as with novel intrinsic rewards when there is no particular task, in settings where the goal is to reach a specific state, and in settings where each step or simulation may be expensive.

Habitat 2.0 (paper, blogpost)

Interesting object, curious agent (Oral session 5: RL & planning, on Fri Dec 10)

Stochastic shortest path: Minimax, parameter-free and towards horizon-free regret

A provably efficient sample collection strategy for reinforcement learning

Speech and NLP

Something Went Wrong
We're having trouble playing this video.

Supervised learning — learning from labeled data — provides state-of-the-art performance in most domains in AI. Yet humans learn accurate predictive models of the world largely from observations, without labeled examples. Moreover, obtaining human annotations of data is time-consuming, error-prone, and resource-intensive. Obtaining large amounts of labeled data has especially been a factor limiting the ability to train speech recognition models on so-called low-resource languages — ones for which there is not abundantly available audio with corresponding transcribed and aligned transcripts.

Our work in this theme presents new methods for training speech recognition models without transcribed data, learning common embeddings for speech and text regardless of language, and new approaches to scaling large sparse mixture-of-expert models that automate the routing of inputs to the most appropriate subsystem.

This includes a new approach to training speech recognition models without requiring transcribed data, making it possible to train models in many languages for which no or very little transcribed data is currently available. We also build on our previous work with LASER to present a new approach to embedding speech and text in a common representation space, such that related sentences are close to each other regardless of whether the input is speech or text, and regardless of what language the speech or text comes from. The new embeddings open up many possibilities, including large-scale speech-to-text and even speech-to-speech mining without first transcribing and then translating the transcription.

Unsupervised speech recognition (Oral Session 3: Deep Learning, on Wednesday, Dec. 8)

Multimodal and multilingual embeddings for large-scale speech mining

Scale has also been a major factor in advancing the state-of-the-art in natural language processing. Our work on Hash Layers demonstrates that it is possible to build large-scale, high-performance mixture-of-expert networks by using deterministic hashing of the input tokens to route inputs to experts. Our method compares favorably to the current state-of-the-art approach, which has been to use models that learn to route inputs through mixtures of experts.

Hash layers for large sparse models

Understanding the world from visual data

Our VolSDF model can take a set of input images (left) and learn a volumetric density (center left, sliced) defined by a signed distance function (center right, sliced) to produce a neural rendering (right). This definition of density facilitates high-quality geometry reconstruction (gray surfaces, middle). Original image is from the BlendedMVS data set, under the Creative Commons Attribution 4.0 license.

Understanding the world from visual data (such as images or video) remains a key challenge for the research community. Visual Transformer architectures provide a powerful new inductive bias for applications involving visual data, generalizing convolutional neural networks. We propose new transformer models for image segmentation and for tracking and action recognition in videos. The MaskFormer model achieves state-of-the-art accuracy simultaneously for semantic and panoptic segmentation. Trajectory attention achieves state-of-the-art accuracy for action recognition across multiple benchmarks. While modalities such as images and video do not explicitly capture 3D information, increasingly visual techniques benefit from exploiting the 3D structure of the world. For example, we have introduced a new self-supervised approach called SEAL for jointly improving object detection and instance segmentation models by moving around physical environments. We also introduce a new technique called VolSDF, which leverages novel neural rendering techniques in order to build a 3D model from a collection of images.

Per-pixel classification is not all you need for semantic segmentation

Keeping your eye on the ball (Oral session 3: Vision applications, on Wednesday, Dec. 8)

SEAL: Self-supervised embodied active learning

Volume rendering of neural implicit surfaces (Oral session 3: Vision applications, on Wednesday, Dec. 8)

Generative models and foundations of representation learning

When conditioned on the image shown on the left along with the class label, the IC-GAN generated the images shown on the right.

The ability to create new, never-seen-before content is an important step along the path to human-level intelligence. In our work on instance-conditioned GANs, we introduce a new family of controllable generative models that produce new images conditioned on the representation of another image and possibly also a class label. This additional level of control allows us to generate images that lie well outside the distribution of images on which the model was trained. We also introduce a novel computationally efficient type of continuous normalizing flow, called Moser Flow, which makes it possible to learn distributions with complex geometric structure.

IC-GAN (paper, blogpost)

Moser Flow: Divergence-based generative modeling on manifolds (Oral session 5: Generative Modeling, on Friday, Dec. 10). This work received an Outstanding Paper Award.

Most lossy image compression methods, such as JPEG, aim to reduce the number of bits required to store an image without impacting its visual quality as perceived by humans. We introduce a new learned compression technique that can significantly reduce the number of bits required to store an image (e.g., using 1,000 times fewer bits) without affecting the ability of downstream models to classify the content of the images.

Lossy compression for lossless prediction

Our researchers will be available at the respective posters to share more about their work and answer questions. We also invite you to drop by Meta AI on Gather.Town to speak with researchers and recruiters.