Research

Introducing droidlet, a one-stop shop for modularly building intelligent agents

July 28, 2021

Share on Facebook

Share on Twitter

Something Went Wrong

We're having trouble playing this video.

Learn more

The annals of science fiction are brimming with robots that perform tasks independently in the world, communicate fluently with people using natural language, and even improve themselves through these interactions. These machines do much more than follow preprogrammed instructions; they understand and engage with the real world much as people do.

Robots today can be programmed to vacuum the floor or perform a preset dance, but the gulf is vast between these machines and ones like Wall-E or R2-D2. This is largely because today’s robots don’t understand the world around them at a deep level. They can be programmed to back up when bumping into a chair, but they can’t recognize what a chair is or know that bumping into a spilled soda can will only make a bigger mess.

To help researchers and even hobbyists to build more intelligent real-world robots, we’ve created and have open-sourced the droidlet platform.

Droidlet is a modular, heterogeneous embodied agent architecture, and a platform for building embodied agents, that sits at the intersection of natural language processing, computer vision, and robotics. It simplifies integrating a wide range of state-of-the-art machine learning (ML) algorithms in embodied systems and robotics to facilitate rapid prototyping.

People using droidlet can quickly test out different computer vision algorithms with their robot, for example, or replace one natural language understanding model with another. Droidlet enables researchers to easily build agents that can accomplish complex tasks either in the real world or in simulated environments like Minecraft or Habitat.

There is much more work to do — both in AI and in hardware engineering — before we will have robots that are even close to what we imagine in books, movies, and TV shows. But with droidlet, robotics researchers can now take advantage of the significant recent progress across the field of AI and build machines that can effectively respond to complex spoken commands like “pick up the blue tube next to the fuzzy chair that Bob is sitting in.” We look forward to seeing how the research community uses droidlet to advance this important field.

A family of agents

Rather than considering an agent as a monolith, we consider the droidlet agent to be made up of a collection of components, some of which are heuristic and some learned. As more researchers build with droidlet, they will improve its existing components and add new ones, which others in turn can then add to their own robotics projects. We believe this heterogenous design makes scaling tractable because it allows training on large data when large data is available for that component. It can also let programmers use sophisticated heuristics when they are available. The components can be trained with static data when convenient (e.g., a collection of labeled images for a vision component) or with dynamic data when appropriate (e.g., a grasping subroutine).

The high-level agent design consists of these interfaces between modules:

A memory system acting as a nexus of information for all agent modules
A set of perceptual modules (e.g., object detection or pose estimation), that process information from the outside world and store it in memory
A set of lower-level tasks, such as “move three feet forward” and “place item in hand at given coordinates,” that can effect changes in the agent’s environment
A controller that decides which tasks to execute based on the state of the memory system

Each of these modules can be further broken down into trainable or heuristic components.

This architecture also enables researchers to use the same intelligent agent on different robotic hardware by swapping out the tasks and the perceptual modules as needed by each robot's physical architecture and sensor requirements.

Substantially reducing friction in integrating ML models

The agent illustrated above demonstrates how to build with droidlet using specified components, but this is not the only way to use the library. The droidlet platform supports researchers building embodied agents more generally by reducing friction in integrating ML models and new capabilities, whether scripted or learned, into their systems, and by providing UX for human-agent interaction and data annotation.

The modules can all be used independent of the main agent, and the state-of-the-art perceptual modules may be of particular value to other researchers, given that current off-the-shelf models are poor for robotic use cases. In addition to the wrappers for connecting ML models to robots, we have model zoos for the various modules, including several vision models fine-tuned for the robot setting (for RGB and RGBD cameras).

Droidlet is bolstered by an interactive dashboard that researchers can use as an operational interface when building agents It includes debugging and visualization tools, as well as an interface for correcting agent errors on the fly or for crowdsourced annotation. As with the rest of the agent, the dashboard prioritizes modularity and makes it easy for researchers or hobbyists to add new widgets and tools.

A powerful and flexible platform

For researchers or hobbyists, droidlet offers batteries-included agents that include primitives for visual perception and language, as well as a heuristic memory system and controller. Droidlet users can incorporate these modules into their robots or simulated agents by writing tasks that wrap primitives like “move to coordinate x, y, z.” These agents can perceive their environment via provided pretrained object detection and pose estimation models, and store their observations in the robot’s memory. Using this representation of the world around them, they can respond to language commands (e.g., “go to the red chair”), leveraging a pretrained neural semantic parser that converts natural language to programs.

Droidlet is also a flexible platform, and the ML modules and dashboards can be used outside the full agent. Over time, droidlet will become even more powerful as we add more tasks, sensory modalities, and hardware setups, and as other researchers and hobbyists build and contribute their own models.

Building intelligent machines that work in the real world is a fundamental scientific goal in AI. Facebook AI is helping the community by releasing not only droidlet and Habitat, but also other, independent research projects such as DD-PPO, our advanced point-goal navigation algorithm; SoundSpaces, our audio-visual platform for embodied AI; and our simple PyRobot framework. The path is long to building robots with capabilities that approach those of people, but we believe that by sharing our research work with the AI community, all of us will get there faster.

Read the paper