From a robot asked to “grab my phone from the desk upstairs” to a device that helps its visually impaired wearer navigate an unfamiliar subway system, the next generation of AI-powered assistants will need to demonstrate a broad range of abilities. Many researchers believe the most effective way to develop these skills is to focus on embodied AI, which uses interactive environments to ground systems’ training in the real world, rather than relying on static data sets. To accelerate progress in this space, we’re sharing AI Habitat, a new simulation platform created by Facebook AI that’s designed to train embodied agents (such as virtual robots) in photo-realistic 3D environments. Our goal in sharing AI Habitat is to provide the most universal simulator to date for embodied research, with an open, modular design that’s both powerful and flexible enough to bring reproducibility and standardized benchmarks to this subfield.
To illustrate the benefits of this new platform, we’re also sharing Replica, a data set of hyperrealistic 3D reconstructions of a staged apartment, retail store, and other indoor spaces that were generated by a group of scientists within Facebook Reality Labs (FRL). To our knowledge, this data set contains the most photo-realistic 3D reconstructions of environments available. This level of detail narrows the training gap between virtual and physical spaces, which we believe is important for transferring the skills learned in simulation to the real world.
While AI Habitat can run Replica’s state-of-the-art reconstructions right now, the platform also works with existing 3D assets created for embodied research, such as the Gibson and Matterport3D data sets. Compatibility and flexibility are the guiding principles behind the platform’s modular software stack, whose layers can be modified separately or swapped out entirely. This approach accommodates a wider range of agents, training techniques, and environments than other simulators, which have typically been built around specific parameters, such as a given data set or type of experiment. We incorporated direct feedback from the research community to develop this degree of flexibility, and also pushed the state of the art in training speeds, making the simulator able render environments orders of magnitude faster than previous simulators.
AI Habitat is available now, and the platform has already been field-tested — we recently hosted an autonomous navigation challenge that ran on the platform and will award Google Cloud credits to the winning teams at the Habitat Embodied Agents workshop at CVPR 2019. And though AI Habitat is built for (and by) embodied AI researchers, it’s also part of Facebook AI’s ongoing effort to create systems that are less reliant on large annotated data sets used for supervised training. As more researchers adopt the platform, we can collectively develop embodied AI techniques more quickly, as well as realize the larger benefits of replacing yesterday’s training data sets with active environments that better reflect the world we’re preparing machine assistants to operate in.
Facebook AI has explored the potential of embodied AI for years, including creating agents that communicate with each other to find their way through simulated NYC streets and agents that navigate virtual indoor environments in order to answer questions. These efforts shared the common goal of developing versatile, problem-solving AI that leverages progress in traditionally distinct research areas, such as using natural language processing (NLP) to communicate with humans or agents, computer vision (CV) to perceive simulated environments, and reinforcement learning (RL) techniques that power the decision-making to navigate real-world spaces. But our work mirrored the larger state of embodied AI research — with no standard platform available to easily run experiments and measure results against others, new projects often required starting from scratch. Given the resource-intensive nature of building and rendering simulated environments, this made progress slower compared with subfields whose standard benchmarks — such as ImageNet and COCO — enable collaboration and rapid iteration.
With AI Habitat, we wanted to retain the simulation-related benefits that past projects demonstrated, such as speeding experimentations and RL-based training, and apply them to a more widely compatible and increasingly realistic platform. In February 2018, Facebook AI hosted a workshop about the lack of standardization in embodied AI. Researchers from more than a dozen leading AI organizations attended, and their wide-ranging input — which included high-level goals as well as discussions related to standardizing the use of different environments, tasks, and physics engines — helped clarify the features that would ensure AI Habitat’s broad utility.
The structure that emerged over more than a year of development is geared toward flexibility. AI Habitat consists of a stack of three modular layers, each of which can be configured or even replaced to work with different kinds of agents, training techniques, evaluation protocols, and environments. Separating these layers differentiates the platform from other simulators, whose design can make it difficult to decouple parameters in order to reuse assets or compare results.
This emphasis on flexibility is also reflected in the simulation engine, called Habitat-Sim, with sits at the base of the stack and includes built-in support for existing 3D environment data sets, such as Gibson, Matterport3D, and FRL’s Replica reconstructions. While our long-term aim is to move toward increasingly photo-realistic simulations, establishing simulation-based benchmarks will require widespread compatibility. By using a single hierarchical scene graph to represent all supported 3D environment data sets, Habitat-Sim can abstract the details of specific data sets, applying them consistently across simulations.
In addition to being widely compatible with existing 3D data sets, the simulator also renders these assets quickly: In benchmark tests, Habitat-Sim with multiple processes renders detailed scenes at 10,000 frames per second (FPS) on a single GPU, compared with a typical rate of 100 FPS on other simulators. That kind of speed boost is relevant to vision-based embodied learning, since increasing the number of frames that an agent experiences in a given period can directly increase training efficiency.
The second layer in AI Habitat’s software stack is Habitat-API, a high-level library for defining tasks such as visual navigation and question answering. The API incorporates the use of such additional data and configurations while still simplifying and standardizing the training and evaluation of embodied agents. And decoupling the API from the simulator will enable users to eventually replace either layer, once the community develops more updated components.
The platform’s third and final layer is its most open-ended. This is the concrete embodied task that systems are being asked to learn through simulation. It’s also where users specify training and evaluation parameters, such as how difficulty might ramp across multiple runs and what metrics to focus on. For more details about the platform’s structure and features, read our white paper.
Though we built AI Habitat to work with existing data sets of 3D environments, the future of this platform — and of embodied AI research more broadly — is in simulated environments that are increasingly indistinguishable from real life. The Replica data set created by FRL provides an example of the hyperdetailed assets that will elevate embodied simulations, as well as the value of leveraging a single resource for different kinds of research.
The data set consists of scans of 18 scenes that range in size, from an office conference room to a two-floor house. The surfaces of these digital models have realistic textures, including difficult-to-reproduce glass and mirror surfaces. In addition to reconstructing the geometry and texture of indoor spaces, we also densely and painstakingly annotated the environments with semantic labels, such as “window” and “stairs,” including labels for individual objects, such as “book” or “plant.” Such annotations are crucial for advancing research in embodied AI as well as more traditional approaches to CV.
To create this data set, FRL researchers used proprietary camera technology and a spatial AI technique that’s based on the simultaneous localization and mapping (SLAM) approaches employed by roboticists to map environments as robots move through them. Replica also captures the details in the raw video, reconstructing dense 3D meshes with both high-resolution and high dynamic range textures.
The data used to generate Replica scans was anonymized to remove any personal details (such as family photos) that could identify an individual. The overall reconstruction process was meticulous, with researchers manually filling in the small holes that are inevitably missed during scanning and using a 3D paint tool to apply annotations directly onto meshes. But even this initial collection of reconstructions spans a variety of indoor spaces and is, to our knowledge, more realistic than any related assets that are publicly available. It also includes a wide range of object instances that are relevant to learning and testing machine learning (ML) tasks. Running Replica’s assets on the AI Habitat platform reveals how versatile active environments are to the research community, not just for embodied AI but also for running experiments related to CV and ML. More details about these reconstructions and their compatibility with AI Habitat can be found in this white paper.
Longer term, Facebook believes that pairing photo-realistic virtual environments with equally realistic avatars will enable true social presence — the feeling that you are together in the same room with someone else and that you are able to communicate your ideas and emotions effortlessly. We believe this is the future of connection. And Replica also demonstrates how researchers across Facebook share resources to advance multiple goals at once.
To demonstrate the utility of AI Habitat’s modular approach and emphasis on 3D photo-realism, we held the Habitat Challenge, a competition that focused on evaluating the specific provided task of goal-directed visual navigation. Unlike traditional challenges where people upload predictions based on a task related to a given benchmark such as ImageNet or VQA, this one required participants to upload code. The code was run on new environments that their agents had not seen before, and we're excited to announce the top-performing teams: Team Arnold (a group of researchers from CMU) and Team Mid-Level Vision (a group of researchers from Berkeley and Stanford). More details will be available on the Habitat Challenge site and more analysis will be presented at the Habitat workshop.
The information that we shared for the challenge functioned as the task layer of AI Habitat’s three-layer stack. As researchers use the platform for their own experiments, they’ll create and apply new task parameters to complete the stack. For our competition, agents were simulated as cylindrical, 1.5-meter-tall robots that could turn left or right in 10-degree increments and move forward 0.25 meter at a time. Stopping within 0.2 meter was considered a successful run, with more detailed evaluation based on the total length of its path.
The challenge was broken up into two tracks for the same navigation task, with one track providing participants with purely visual RGB sensor data and the other adding depth information (or RGBD data). The environments that agents navigated through were drawn from the Gibson data set, whose 3D reconstructions of indoor spaces were drawn from 572 real buildings. Though the Replica data set wasn’t available at the time of the challenge, the Gibson scans allowed us to demonstrate that our simulation is compatible with existing assets.
The input we received from challenge participants indicates that AI Habitat is already delivering on many of its goals. Those included allowing for rapid initial adoption (due to its API layer) and providing a task and a simulated environment that work for embodied AI experiments, as well as a more conventional benchmark for navigation algorithms.
And though they weren’t eligible to win the competition, Facebook AI research intern Erik Wijmans (a PhD student at Georgia Tech) and AI Resident Bhavana Jain used the challenge data set and Habitat to conduct internal experiments at an extremely large scale, training agents with more than a billion frames of experience. Assuming an average person is capable of taking an “action” (e.g., move forward 0.25 meter or turn 10 degrees) once every second, the agents trained by Wijmans and Jain learned from the equivalent of 31.7 years of experience. These state-of-the-art agents outperformed all the official entries in the challenge, indicating the simulation platform’s value both for the wider community and for embodied research at Facebook.
Though AI Habitat and Replica are already powerful open resources, these releases are part of a larger commitment to research that’s grounded in physical environments. This is work that we’re pursuing through advanced simulations, as well as with robots that learn almost entirely through nonsimulated, physical training. Traditional AI training methods have a head start on embodied techniques that’s measured in years, if not decades. That’s why we’re going to keep pushing the capabilities of AI Habitat, including a range of planned upgrades to the simulator, such as incorporating larger and more complex tasks and experiments down the line. For example, the engine’s resource-efficient rendering techniques are able to produce multiple channels of visual information to multiple agents at once, paving the way for simulations filled with concurrent agents. And while Replica has demonstrated that AI Habitat can accurately simulate how the world looks, the platform’s features will allow us to incorporate physics-based interactions, so mobile agents can manipulate 3D objects. Habitat-Sim’s features effectively future-proof its scale and scope, increasing its relevance — and appeal — as a unified platform.
But the full impact of AI Habitat will depend less on our upgrades than on its adoption. As more researchers use the platform, the community will surprise us, providing access to more examples of promising agents, tasks, and strategies, along with even more realistic active environments. We hope this platform will establish research benchmarks that are as universal for embodied tasks as ImageNet, COCO, and VQA are for image recognition systems, introducing the kind of scientific rigor and reproducibility that will make embodied AI-powered assistants not only viable but inevitable.
Beyond being a fundamental scientific goal in AI, even a small advance toward such intelligent systems could fundamentally enhance our lives. From self-driving cars that can grasp the nuances of natural language commands to domestic robots that adapt to new homes and personalized tasks without being retrained, embodied AI could lead to systems with a humanlike range of skills. The path to those capabilities is long, but open resources such as Replica and AI Habitat have the potential to get all of us there faster.
AI Habitat is a large team effort. We would like to acknowledge Yili Zhao, Erik Wijmans, Bhavana Jain, Julian Straub, Jerry He, Angel Chang, Jiazhi Zhang, Jia Liu, Vladlen Koltun, Jitendra Malik, and Devi Parikh for their contributions, as well as the FRL engineers, researchers, and others working on Replica. We thank the Gibson Virtual Environment team for preparing and hosting their dataset for the Habitat Challenge, and Dmytro Mishkin and Alexey Dosovitskiy for open-sourcing their SLAM baseline.
Research Scientist, Facebook AI
Research Engineer, Facebook AI
Research Engineer, Facebook AI