Research

Teaching AI to be more collaborative with humans without learning directly from them

April 18, 2022

What the research is:

It’s often not enough for robots to excel at a given task. They must also behave in a way that people can easily understand and be able to anticipate how others will respond to their actions. This has been a difficult AI challenge, however. With the most widely used approach — reinforcement learning (RL), where the agents learn mainly from rewards collected during interactions with the environment — the agent typically develops its own unique behaviors and communication protocols. It might arbitrarily decide, for example, that the letter A represents blue pieces or that it’s best not to play a green card after playing two red cards. These can be unintelligible both to humans and to other agents trained independently.

This isn’t an issue with competitive games, such as Go or chess. But intelligibility is essential in cooperative games, such as bridge or gin rummy, where two partners must work together and know how best to help each other even with very limited information about each other’s cards. Previously, agents trained in RL without human labeled data unavoidably pick up arbitrary conventions, making them unsuitable for real world human-AI cooperation

Meta AI has developed a new, more flexible approach to teaching AI to cooperate and make their actions understandable to people: off-belief learning. Instead of using human labeled data, off-belief learning starts with the quest to search for a "grounded communication,” where the goal is to find the most efficient way to communicate without assuming any prior conventions. We are sharing a paper on our work along, open-sourcing the code, and releasing a public demo where everyone can play with our model trained using off-belief learning.

How it works:

Let’s consider a simple cooperative card game between two strangers, Alice and Bob. Alice draws a random card that’s either red or blue. While she’s able to see each card she draws, Bob can’t and must guess the color of the card. Each time he chooses the right color — red or blue -— both players win $5. Alice can either send number 1 or 2 to Bob, or simply reveal the card to him. While it is free to send a number, it costs $1 to reveal the card.

In this game, the grounded communication is to avoid the cheaper options and always just show Bob the true color. This is the same approach that humans would use when communicating with strangers with whom they haven’t established a prior rapport. In conventional multiagent RL methods, the two agents would converge using 1 to represent one color and 2 for the other. But that approach wouldn’t work well in playing with humans or with another independently trained agent. Since there’s no way in advance to know whether an agent uses 1 to represent a particular color or not, they wouldn’t know whether 1 represented blue or red.ation? Should this thing that looks like an upside-down cat be grouped with the other cats?

This is where off-belief learning comes in. The goal of off-belief learning is to find the most efficient way to communicate without assuming any prior conventions. This grounded policy can then be used as the basis for more advanced policies. Key to this approach is our ability to fix the common-knowledge policy that each agent can always assume the other agents are operating under, even though the actual policy used by the players can be drastically different. If we pick the uniform random policy as our common-knowledge policy — which samples each action with equal probability and where there is no shared prior knowledge between the agents — both agents learn to behave as if there were no prior conventions at all.

Let’s go back to the example of Alice's and Bob’s card game. Suppose Alice initially has a convention of sending 1 to mean blue and 2 to mean red, but Bob always behaves as if Alice’s messages were sent by a randomly acting agent. As a result, Bob will only guess the color consistently when Alice sends the true color, rather than when she sends 1 or 2. As a result, Alice eventually learns she can’t communicate any meaningful information by sending 1 or 2 and will realize she is better off paying $1 to send the true color.

Any joint policies that do not rely on prior conventions are grounded policies, such as the best response to a uniform random policy. The special power of off-belief learning is that it can find the optimal one by fixing the interpretation of the past while optimizing the current jointly for all players. In more complex scenarios where grounded play may not be the best solution, we can use the outcome of off-belief learning as the new common-knowledge policy and apply off-belief learning again. For each time that we repeat this, agents can develop new, more complicated ways to communicate while sharing a common trace of reasoning originated from the grounded policy.

Why it matters:

AI already has so many capabilities, but its applications will be limited if it acts in ways people would never expect. Off-belief learning will help address this. In many ways, this process resembles how people approach complex interpersonal communication problems. While we may have to spend more time explaining and clarifying something in a new relationship, over time, as we learn how to communicate effectively with someone, we establish a shared language and logical reasoning that help us communicate more effectively.

In this latest research, we propose an efficient implementation of this algorithm and test it via Hanabi, a collaborative card game and key benchmark game for AI research. It features both cooperative gameplay and imperfect information. We found that off-belief learning significantly improves how well an AI agent can collaborate with a human proxy policy (meaning a policy that attempts to mimic the behavior of an actual person) without using human data. Many Hanabi players found it more intuitive to play with agents trained in this way than with agents from prior works. This method is a valuable next step toward building and enabling collaborative assistants that will one day be both ubiquitous and tremendously helpful to our lives.

Read the paper

Written By

Hengyuan Hu

Research Engineer

David Wu

Research Engineer

Jakob Foerster

Professor at University of Oxford and former Research Scientist at Meta