Computer Vision

New milestones in embodied AI

8/21/2020

To accomplish a task like checking to see whether you locked the front door or retrieving a cell phone that’s ringing in an upstairs bedroom, AI assistants of the future must learn to plan their route, navigate effectively, look around their physical environment, listen to what's happening around them, and build memories of the 3D space. These smarter assistants will require new advances in embodied AI, which seeks to teach machines to understand and interact with the complexities of the physical world as people do.

Today, we’re announcing several new milestones that introduce important capabilities to push the limits of embodied agents even further. This foundational research introduces state-of-the-art embodied agents that learn how to explore and understand more complex, realistic spaces from egocentric views or multimodal signals:

The first audio-visual platform for embodied AI. With this new platform, researchers can train AI agents in 3D environments with highly realistic acoustics. This opens up an array of new embodied AI tasks, such as navigating to a sound-emitting target, learning from echolocation, or exploring with multimodal sensors. Adding sound not only yields faster training and more accurate navigation at inference, but also enables the agent to discover the goal on its own from afar. To facilitate future work in this new direction, we collaborated with Facebook Reality Labs to release SoundSpaces, audio rendering for the publicly available Replica and Matterport3D environments.
We built an end-to-end learnable framework for building top-down semantic maps (showing where objects are located) and spatio-semantic memories (“mental maps”) from egocentric observations. This new research enables agents to learn and reason about how to navigate to objects seen during a tour (e.g., find the table) or answer questions about the space (e.g., how many chairs are in the house?).
And, we achieved state-of-the-art results in both navigation and exploration of unfamiliar spaces even when some areas are hidden or out of view (e.g., behind a table). Our occupancy anticipation approach won first place in the Habitat 2020 point-goal navigation challenge at the Conference on Computer Vision and Pattern Recognition (CVPR) 2020, significantly surpassing scores from the other entrants.

These advances leverage our previous work in this subfield, including our state-of-the-art, open source AI Habitat simulation platform (with built-in support for Facebook Reality Lab’s Replica dataset of photorealistic virtual environments, as well as Matterport, Gibson, and other datasets); DD-PPO, a distributed reinforcement learning (RL) algorithm enabling massive-scale training of AI agents that can perform near-perfect point-goal navigation; and Ego-Topo, a video encoder that transforms egocentric video into a human-centric map capturing how people use a physical space.

AI Habitat is at the core of the new embodied systems we’re announcing today, helping propel agents to learn more humanlike skills, from multimodal sensory understanding to complex reasoning about objects and places. The simulation platform can train virtual robots in photorealistic 3D environments, capable of running at more than 10,000 frames per second on a single GPU — more than 100x faster than real time. Early experiments in transferring skills from Habitat to a physical robot have been promising, and we believe these breakthroughs will speed progress toward building machines that can understand and operate intelligently in the real world.

Introducing SoundSpaces: The first audio-visual platform for embodied AI

Something Went Wrong

We're having trouble playing this video.

Learn more

Both sights and sounds constantly drive our activity: A crying child draws our attention; the sound of breaking glass may require urgent help; the kids are talking to Grandma on the phone in the family room, so we decide to lower the volume on the TV.

Today’s embodied agents are deaf, lacking this multimodal semantic understanding of the 3D world around them. We’ve built and are now open-sourcing SoundSpaces to address this need: it’s a first-of-its-kind dataset of audio renderings based on geometrical acoustic simulations for complex 3D environments. Built on top of AI Habitat, SoundSpaces provides a new audio sensor, making it possible to insert high-fidelity, realistic simulations of any sound source in an array of real-world scanned environments from the Replica and Matterport3D datasets.

Leveraging SoundSpaces, we introduce a new task for embodied AI: AudioGoal, where the agent must move through an unmapped environment to find a sound-emitting object, such as a phone ringing. To our knowledge, this is the first attempt to train deep reinforcement learning agents that both see and hear to map novel environments and localize sound-emitting targets. With this approach, we achieved faster training and higher accuracy in navigation than with single modality counterparts.

Unlike traditional navigation systems that tackle point-goal navigation, our agent doesn't require a pointer to the goal location. This means an agent can now act upon “go find the ringing phone” rather than “go to the phone that is 25 feet southwest of your current position.” It can discover the goal position on its own using multimodal sensing (see figure below).

When trained to navigate to a sound-emitting target, our agent automatically learns to encode features from the raw audio stream that capture the likely goal position relative to the agent. Images show the t-SNE projections of the learned audio features, color-coded to reveal their correlation with the goal location and orientation — the sound source is far (red) or near (violet) and to the left (blue) or right (red) of the agent.

Finally, our learned audio encoding provides similar or even better spatial cues than GPS displacements. This suggests how audio could provide immunity to GPS noise, which is common in indoor environments.

To build SoundSpaces, we used a state-of-the-art algorithm for room acoustics modeling and a bidirectional path tracing algorithm to model sound reflections in the room geometry. Since materials influence the sounds received in an environment (e.g., walking across marble floors versus on a shag carpet), SoundSpaces also models the acoustic material properties of major surfaces, capturing fine-grained acoustic properties like sound propagation through walls. SoundSpaces also allows rendering multiple concurrent sound sources placed at multiple locations in the environment. With SoundSpaces, researchers can train an agent to identify and move toward a sound source even if it’s behind a couch, for example, or to respond to sounds it has never heard before. The image below depicts sounds rendered at two different locations of an environment.

Given our SoundSpace simulations, for an audio source placed in any location, a user of our platform can generate the acoustically realistic ambisonic and binaural audio that would be heard at any particular listener location, for any desired sound source waveform. Two such source locations are shown here. Notice how the sound received by the agent at different positions changes when the sound source moves, and how 3D structures influence the sound propagation.

To help the AI community more easily reproduce and build on this work, we provide precomputed audio simulations to allow on-the-fly audio sensing in Matterport3D and Replica. By extending these AI Habitat-compatible 3D assets with our audio simulator, we enable researchers to take advantage of the efficient Habitat API and easily incorporate audio for AI agent training.

Several other new research projects from Facebook AI are now leveraging this audio-visual platform. One interesting direction we’ve found is that multimodal input enables strong results using self-supervised learning or RL:

See, Hear, Explore: Curiosity via Novel Audio-Visual Association: We created a new formulation of curiosity, which rewards novel associations between different modalities (in our case pixels and sounds). Using multimodal association for intrinsic motivation yields three times faster exploration than visual curiosity approach in Habitat.
VisualEchoes: Spatial Image Representation Learning through Echolocation: We also built a new self-supervised system that learns from echoes and visual observations about the spatial properties of a scene, improving spatial tasks of depth estimation and surface normal prediction. Watch the video for VisualEchoes here
Audio-Visual Waypoints for Navigation: In addition to self-supervision, we also created a novel RL-based technique for audio-visual navigation. Our agent predicts audio-visual waypoints to dynamically set intermediate goals, and builds an acoustic memory for a structured, spatially grounded record of what the agent has heard as it moves (see the figure below). Watch the video here.

Building top-down semantic maps and episodic memories from egocentric observations

Something Went Wrong

We're having trouble playing this video.

Learn more

When people are familiar with a particular place, such as their home or office, they can do much more than simply navigate these spaces. They can also intuitively answer questions such as whether the kitchen is next to the laundry room, or how many chairs are in the second-floor conference room. To build robust and capable AI assistants that can also perform these sorts of tasks well, we need to teach machines to explore, observe, and remember a space from their first-person points of view and then create a third-person (allocentric) top-down semantic map of that 3D environment.

Toward this goal, we’ve built and are now sharing Semantic MapNet, a new module for embodied AI agents, which uses a novel form of spatio-semantic memory to record the representations or “features” of objects observed in egocentric frames as it explores its unfamiliar surroundings. These semantic representations of 3D spaces can then provide a foundation for the system to accomplish a wide range of embodied AI tasks, including question answering and navigating to a particular location.

Semantic Mapnet sets a new state of the art for predicting where particular objects, such as a sofa or a kitchen sink, are located on the pixel-level, top-down map that it creates. It outperforms the previous approaches and baselines on mean-IoU, a widely used metric for the overlap between prediction and ground truth.

Something Went Wrong

We're having trouble playing this video.

Learn more

Given an RGBD observation, our Semantic MapNet architecture extracts egocentric features and then projects them into corresponding locations in a allocentric memory tensor. This tensor can then be decoded to produce top-down semantic maps of the environment.

Previous embodied AI systems have typically used standard computer vision approaches to label the pixels the AI agent sees from its egocentric point of view. These labels, such as “table” or “kitchen,” are then projected into a 2D overhead map. But because the system is segmenting objects from an egocentric point of view before creating the map, any mistake at the egocentric object boundary results in “label splatter” in the map, where the map boundaries are imprecisely and incorrectly drawn. Another approach is to first create the 2D map and then segment the objects shown in it. This is inefficient, however, and discards significant visual information. The resulting map frequently misses small objects and underestimates the size of bigger ones.

Semantic MapNet’s novelty is building a spatio-semantic allocentric memory, and improves upon these methods by using an end-to-end learnable framework that extracts visual features from its egocentric observations and then projects them to appropriate locations in an allocentric spatial memory representation. It can then decode the top-down map of the environment with highly accurate semantic labels of the objects it’s seen. This enables Semantic MapNet to smooth out any “feature splatter” and also recognize and segment small objects that may not be visible from a bird’s eye view. By projecting features, our decoder is able to produce a top-down semantic map that enables the system to reason about multiple observations of a given point and its surrounding area.

These capabilities of building neural episodic memories and spatio-semantic representations are important for improved autonomous navigation, mobile manipulation, and egocentric personal AI assistants. The spatio-temporal semantic allocentric maps produced by Semantic MapNet can be leveraged for downstream embodied reasoning tasks. For example, enabling agents to quickly follow natural language instructions in 3D environments as in two recent works to be presented at ECCV 2020: one on leveraging vision-and-language corpora to improve navigation and another on language-guided navigation in continuous 3D environments.

State-of-the-art exploration and navigation in unmapped spaces

Something Went Wrong

We're having trouble playing this video.

Learn more

One of the keys to recent progress in AI navigation has been the movement toward complex map-based architectures that capture both geometry and semantics. But even state-of-the-art approaches, such as our DD-PPO navigation algorithm, are limited to encoding what the agent actually sees in front of it. We wanted to create agents that are robust even when faced with more challenging situations, such as obstructions or unmapped areas. To push the frontier for navigation, we developed an occupancy anticipation approach, which received first place for the PointNav task at this year’s CVPR Habitat 2020 challenge. This new competition requires agents to adapt to noisy RGB-D sensors and noisy actuators and to operate without GPS or compass data.

To do this, we introduced a novel model that anticipates occupancy maps from normal field-of-view RGB-D observations, while aggregating its predictions over time in tight connection with learning a navigation policy. In contrast to existing methods that only map visible regions, the agent builds its spatial awareness more rapidly by inferring parts of the map that are not directly observed. For example, looking into a dining room, the agent anticipates that there is free space behind the table, or that the partially visible wall extends and opens to a hallway out of view (as shown in the graphic below). We achieved state-of-the-art performance for both exploration and navigation. Because our agent creates maps while anticipating areas not directly visible to it, the agent is faster and more efficient in exploration and navigation tasks.

Given only the partial RGB-D observation of the scene, our occupancy anticipation agent infers that it is quite likely that a hallway extends beyond the visible doorway and that freespace continues all the way around the table partially glimpsed to its left. Such inferences allow the agent to rapidly map a new environment without physically visiting as many parts of it — yielding more complete maps in less time.

We outperform the best competing method using only a third the number of agent movements, and attain 30 percent better map accuracy for the same amount of movements. With occupancy anticipation and a learned visual odometry model, we achieved 19.2 percent higher SPL (success rate weighted by path length) compared with DD-PPO, which has a map-free navigation architecture and assumes perfect odometry.

Laying the foundation for the next generation of embodied AI systems

In addition to the work above, we’ve published the following research to help carry embodied AI forward, many of which we will also be presenting this at ECCV this year:

To further advance navigation for challenging scenes in different but complementary ways, we created a new algorithm for the room navigation task in Habitat. Similar to humans, it builds beliefs by predicting top-down maps and uses these beliefs (e.g., bathroom is typically close to to drawing room/bedroom) to efficiently navigate to rooms. As a step between navigation of points and objects, this will be useful for training agents that need to successfully navigate to a room first to collect an object (e.g., a knife will most likely be in a kitchen).
We’ve also improved the way virtual robots follow instructions in simulation, by utilizing disembodied large-scale image captioning datasets (e.g., go down the hallway, stop at the brown sofa).
For the same type of language-guided navigation task, we’ve created the first system that generalizes its ability to follow instruction in Habitat’s continuous environments. Watch the video here.
We’ve systematically tested the four key paradigms for embodied visual exploration, to create a new benchmark for the field and understand what intrinsic rewards best promote fast mapping in unfamiliar environments. See the study and code here.
Moving beyond purely navigational tasks, we are investigating how embodied agents operating in human spaces can more effectively interact with the environment. To this end, we introduced an RL approach for exploration for interaction, whereby an embodied agent autonomously discovers the affordance landscape of a new unmapped 3D environment (such as an unfamiliar kitchen).

We’re making rapid progress in training agents to accomplish a wide range of challenging tasks in photorealistic 3D simulators featuring complex physical dynamics. We’ve come a long way, going well beyond simple gamelike settings like mastering Space Invaders a few years ago.

These efforts are part of Facebook AI’s long-term goal of building intelligent AI systems that can intuitively think, plan, and reason about the real world, where even routine conditions are highly complex and unpredictable. Combining our embodied AI systems with breakthrough 3D deep learning tools and new ways to reason about 3D objects from 2D images, for instance, will further improve understanding of objects and places.

By pursuing these related research agendas and sharing our work with the wider AI community, we hope to accelerate progress in building embodied AI systems and AI assistants that can help people accomplish a wide range of complex tasks in the physical world.

We’d like to acknowledge the contributions to SoundSpaces, Semantic MapNet, and Occupancy Anticipation from our collaborators at University of Texas at Austin, University of Illinois, Georgia Tech, and Oregon State.