Research

Near-perfect point-goal navigation from 2.5 billion frames of experience

1/21/2020

The AI community has a long-term goal of building intelligent machines that interact effectively with the physical world, and a key challenge is teaching these systems to navigate through complex, unfamiliar real-world environments to reach a specified destination — without a preprovided map. We are announcing today that Facebook AI has created a new large-scale distributed reinforcement learning (RL) algorithm called DD-PPO, which has effectively solved the task of point-goal navigation using only an RGB-D camera, GPS, and compass data. Agents trained with DD-PPO (which stands for decentralized distributed proximal policy optimization) achieve nearly 100 percent success in a variety of virtual environments, such as houses and office buildings. We have also successfully tested our model with tasks in real-world physical settings using a LoCoBot and Facebook AI’s PyRobot platform.

An unfortunate fact about maps is that they become outdated the moment they are created. Most real-world environments evolve — buildings and structures change, objects are moved around, and people and pets are in constant flux. By learning to navigate without a map, DD-PPO-trained agents will accelerate the creation of new AI applications for the physical world.

Previous systems reached a 92 percent success rate on these tasks, but even failing 1 out of 100 times is not acceptable in the physical world, where a robot agent might damage itself or its surroundings by making an error. DD-PPO-trained agents reach their goal 99.9 percent of the time. Perhaps even more impressive, they do so with near-maximal efficiency, choosing a path that comes within 3 percent (on average) of matching the shortest possible route from the starting point to the goal. It is worth stressing how uncompromising this task is. There is no scope for mistakes of any kind — no wrong turn at a crossroads, no backtracking from a dead end, no exploration or deviation of any kind from the most direct path. We believe that the agent learns to exploit the statistical regularities in the floor plans of real indoor environments (apartments, houses, and offices) that are also present in our datasets. This improved performance is is powered by a new, more effective system for distributed training (DD-PPO), along with the state-of-the-art speed and fidelity of Facebook AI’s open source AI Habitat platform .

Navigation is essential for creating AI agents and assistants that help people in the physical world, from robots that can retrieve an object from a desk upstairs, to systems that help people with visual impairments, to AI-powered assistants that present relevant information to people wearing augmented reality glasses. We hope to build on DD-PPO’s success by creating systems that accomplish point-goal navigation with only camera input — and no compass or GPS data. This will help researchers build agents that work in common settings, such as inside office buildings or laboratories, where these additional data points aren’t available. In addition to open-sourcing the DD-PPO code and trained models, we are creating a new challenge to perform point-goal navigation using only RGB-D input.

Efficient RL for large-scale distributed environments

Recent advances in deep RL have given rise to systems that can outperform human experts at a variety of games. These advances rely on a large volume of training samples, making them impractical without large-scale, distributed parallelization.

Several works have proposed systems for distributed RL. At a high level, these works utilize two notable components: “rollout” workers that collect experience and a parameter server that optimizes the model.

In our accompanying research paper, which is being presented at ICLR 2020, we argue that this paradigm — a single parameter server and thousands of (typically CPU) workers — may be fundamentally incompatible with the needs of modern computer vision and robotics communities. Specifically, over the last few years, a large number of vision and robotics works have proposed training virtual robots (commonly called embodied agents) in rich 3D simulators, such as Facebook AI’s open source AI Habitat . Unlike Gym or Atari, 3D simulators require GPU acceleration, which greatly limits the number of workers (2^5-8 vs. 2^12-15). The desired agents operate from high-dimensional inputs (pixels) and use deep networks, such as ResNet50, which strain the parameter server. Thus, existing distributed RL architectures do not scale and there is a need to develop a new distributed architecture.

Delivering near-linear scaling

We propose a simple, synchronous, distributed RL method that scales well. We call this method decentralized distributed proximal policy optimization, as it is decentralized (has no parameter server) and distributed (runs across many different machines), and we use it to scale proximal policy optimization, a previously developed technique ( Schulman et al., 2017 ). In DD-PPO, each worker alternates between collecting experience in a resource-intensive, GPU-accelerated simulated environment and then optimizing the model. This distribution is synchronous — there is an explicit communication stage in which workers synchronize their updates to the model.

The variability in experience collection runtime presents a challenge to using this method in RL. In supervised learning, all gradient computations take approximately the same time. In RL, some resource-intensive environments can take significantly longer to simulate. This introduces significant synchronization overhead, as every worker must wait for the slowest to finish collecting experience. To address this, we introduced a preemption threshold where the rollout collection stage of these stragglers is forced to end early once some percentage, p percent, (we find 60 percent to work well) of the other workers are finished collecting their rollout, thereby dramatically improving scaling. Our system weighs all workers' contributions to the loss equally and limits the minimum number of steps before preemption to one-fourth the maximum to ensure that all environments contribute to learning.

Something Went Wrong

We're having trouble playing this video.

Learn more

This graphic shows how information is shared during training with DD-PPO. First, all workers collect experience by simulating an agent performing the task (e.g. point-goal navigation). Then all workers optimize the model based on that experience and synchronize their updates (the gradient).

Something Went Wrong

We're having trouble playing this video.

Learn more

This graphic shows how information is shared during training with asynchronous distribution. A set of workers asynchronously collects experience by simulating an agent performing the task. After a fixed amount of experience is collected, the worker sends the experience to a centralized server that utilizes the experience to update the model and send updated parameters to the worker. The centralized server does not wait for all workers before updating the model — it does so asynchronously.

We characterized the scaling of DD-PPO by the steps of experience per second with N workers relative to one worker. We considered two different workloads: One where simulation time is roughly equivalent for all environments and another where simulation time can vary dramatically due to large differences in environment complexity.

Under both workloads, we found that DD-PPO exhibits near-linear scaling — achieving a speedup of 107x on 128 GPUs over a serial implementation.

DD-PPO demonstrates near-linear scaling as the number of GPUs increases from one to 250.

Leveraging DD-PPO for near-perfect point-goal navigation on AI Habitat

We trained and evaluated DD-PPO using our AI Habitat platform . Habitat is a modular framework with a highly performant and stable simulator, making it an ideal framework for simulating billions of steps of experience. Habitat runs at 10K frames/second (multiprocess) and can work with a wide variety of datasets, including Replica , the most realistic virtual environment for AI research currently available. We experimented with Replica as well as several hundred scenes from the Gibson dataset.

Something Went Wrong

We're having trouble playing this video.

Learn more

In this animation, the agent successfully navigates to the destination marked in red.

In point-goal navigation, an agent is initialized at a random starting position and orientation in a new environment and asked to navigate to target coordinates specified relative to the agent’s position. No map is available, and the agent must navigate using only its sensors — GPS+Compass (to provide its current position and orientation relative to start) and either an RGB-D or RGB camera.

This graph shows the RGB-D-equipped agent continues to perform well even as distance to its goal increases. When equipped with only an RGB camera, the agent’s performance deteriorates at distances of more than 25 meters. SPL refers to success rate weighted by normalized inverse path length (roughly the efficiency of the agent’s path).

We used DD-PPO to train an agent for point-goal navigation for 2.5 billion steps (the equivalent of 80 years of human experience). This represented more than six months of GPU-time training, but we completed it in less than three days of wall-clock time with 64 GPUs. As a comparison, previous methods, such as that developed by Savva et al., would require more than a month of wall-clock time.

Performance continues to improve as training nears 2.5 billion steps of experience. Results here are measured by SPL, which refers to success rate weighted by normalized inverse path length.

Furthermore, our results show that the performance of an agent (with RGB-D and GPS+compass sensors) does not saturate before 1 billion steps, suggesting that previous studies were incomplete by one to two orders of magnitude. Fortuitously, error vs. computation exhibits a power-law-like distribution, with 90 percent of peak performance obtained relatively early (100 million steps) and with relatively few computing resources (in one day with 8 GPUs).

Reaching billions of steps of experience not only sets the state of the art on the Habitat Autonomous Navigation Challenge 2019 but also essentially solves the task. It achieves a success rate of 99.9 percent and a score of 96.9 percent on the SPL efficiency metric. (SPL refers to success rate weighted by normalized inverse path length.)

Qualitative examples of intelligent behavior

Our agent also shows the ability to make intelligent choices, such as picking the correct fork in the road. In the example below, the agent’s compass indicates its goal is straight ahead. But it sees there are walls ahead and on the left, so it correctly determines that right is the optimal direction. When the agent makes an incorrect decision, it’s also able to recognize its error and then backtrack and choose the correct path.

Something Went Wrong

We're having trouble playing this video.

Learn more

In this video, the agent has to make a series of choices. It correctly guesses that it needs to turn right to exit the living room, and then correctly chooses to go left at the next fork.

In the example below, the agent makes a wrong decision and does not enter the bedroom door. Once the agent realizes this error, it quickly backtracks, enters the correct room, and successfully reaches the goal.

Something Went Wrong

We're having trouble playing this video.

Learn more

This video shows the agent backtracking after having chosen the wrong path to get to its goal.

Our hypothesis is that the agent achieves this by learning to exploit the structural regularities in layouts of real indoor environments. One (admittedly imperfect) way to test this is by training a “blind” agent that is equipped with only a GPS+Compass sensor but no cameras. The agent is able to handle short-range navigation. But on longer trajectories, it performs very poorly, with a success rate of 50 percent vs. 99 percent for an RGB-D-equipped agent at 20-25m navigation. Thus, we’ve concluded that structural regularities partly explain performance for short-range navigation. For long-range navigation, it is harder to exploit those patterns in the layout of offices and homes, so the RGB-D agent is more dependent on its depth sensor to reach its goal.

We also investigated whether we also achieve better performance on the significantly more challenging task of navigation from RGB without GPS+compass data. At 100 million steps, the agent achieves 0 percent success. By training to 2.5 billion steps, we make some progress and achieve a 16 percent success rate. While this is a substantial improvement, the task continues to remain an open frontier for research in embodied AI.

An open approach to the next challenges in intelligent navigation

There is much work left to do to create machines that learn to navigate through challenging real-world settings in order to accomplish complex tasks. We look forward to exploring new solutions to RGB-only point-goal navigation, which is important because compass and GPS data can be noisy or simply unavailable in indoor spaces. We will also apply DD-PPO-trained models to different tasks.

Our models are able to rapidly learn new tasks (outperforming ImageNet pretrained CNNs) and can be utilized as near-perfect neural point-goal controllers — a universal resource for other high-level navigation tasks, such as maximizing the agent’s distance from its starting point. We hope to build on this work to create systems that perform more semantic, higher-level tasks such as ObjectNav (“Go to a chair”), instruction following (“Go out of the room, turn left, go down the hallway and up the stairs, and stop at the desk”), Embodied Question Answering (“Is my laptop on my desk?”), and eventually manipulating objects in response to requests like “Bring me my laptop from my desk.” We’ll also continue our work bringing these agents from simulation to real-world robotic agents, such as LoCoBots using Facebook AI’s PyRobot framework.

The AI community will make faster progress toward these research goals if we choose to work openly and share our advances. In addition to publishing details about DD-PPO, we’re preparing the next annual Habitat challenge for the Conference on Computer Vision and Pattern Recognition. By building an open source research community around this work, we can develop new ways for AI to help people accomplish tasks in the physical world.