December 06, 2019
We’ve built an AI bot that achieves state-of-the-art results in Hanabi, a collaborative card game that has been cited as a benchmark game for AI research because it features both cooperative gameplay and imperfect information.
Our bot outperforms previous AI algorithms at Hanabi by using real-time search to fine-tune its decisions during gameplay. It’s the first bot to exceed elite human performance in the game, as judged by experienced players who have evaluated it.
Researchers have found it challenging to apply search beyond perfect-information games like chess and Go. Our success with Hanabi suggests search can improve more AI systems and eventually help build AI that learns to master complex cooperative tasks in real-world settings.
Artificial intelligence bots have achieved superhuman results in zero-sum games such as chess, Go, and poker, in which each player tries to defeat the others. However, just like humans, real-world AI systems must learn to coordinate in cooperative environments as well.
To advance research on AI that can understand other points of view and collaborate effectively, Facebook AI has developed a bot that sets a new state of the art in Hanabi, a card game in which all players work together. Our bot not only improves upon previous AI systems but also exceeds the performance of elite human players, as judged by experienced players who have evaluated it.
Earlier this year, AI researchers at DeepMind and Google Brain proposed that Hanabi be a new frontier for AI research because it combines cooperative gameplay and imperfect information in a multiplayer setting. Effective Hanabi players, whether human or machine, must try to understand the beliefs and intentions of other players, because they can’t see the same cards their teammates see and can only share very limited hints with each other. Developing this “theory of mind” allows players to understand their teammates’ actions and predict how they will respond to their own decisions.
Our bot achieves this by implementing a new real-time search method that is similar to the depth-limited search technique used in Pluribus, the first bot to defeat elite professionals in six-player no-limit Texas Hold’em poker. This search method works in conjunction with a precomputed strategy, allowing the bot to fine-tune its actions for each situation it encounters during gameplay. Our search technique can be used to significantly improve any Hanabi strategy, including deep reinforcement learning (RL) algorithms that set the previous state of the art.
Search methods used for perfect-information games such as backgammon, chess, and Go don’t work in environments like Hanabi, because each player can see some but not all of the cards in play. With only imperfect information about the current state of the game, the bot must account for many different possible “world states” when deciding on its best action. Performing search in these circumstances is more complex and typically requires much more computational resources.
Although much of multi-agent AI research has focused on the zero-sum competitive setting, research into cooperative AI techniques that allow for effective coordination using limited communication has far more potential real-world uses. For example, navigating traffic requires clearly signaling intent and interpreting the intention of others in partially observable environments, and AI systems deployed in workplaces need to be able to interpret nonverbal cues that humans provide and understand their meaning in the context of the state of the world. Likewise, conversational agents will be more effective if they understand the beliefs and intentions of the people they’re interacting with. We’re excited that the results we have achieved in Hanabi may move us closer to deploying helpful, cooperative AI agents in the real world.
Hanabi is a fully cooperative card game that is sometimes described as a form of “team solitaire.” The game is partially observable and, as a key twist, each player can see the hands of each of her teammates, but cannot observe the cards she herself is holding.
As a consequence, players must exchange information in order to understand which cards to play or discard, but they can communicate only through very limited hints about certain cards’ color or number. Human players overcome these challenges by observing teammates’ actions and interpreting their beliefs and intentions. A player might reason, for example, that a teammate’s hint about a certain card’s color means she wants him to play that card on his next turn. Players also leverage their common knowledge about the game: For example, all the players know that Hanabi is a collaborative game, so they can conclude a teammate’s hint isn’t meant to trick or mislead.
This ability to deduce the thinking behind someone’s behavior is referred to as theory of mind and is the key aspect that makes Hanabi both fascinating for humans to play and for AI researchers to study. Effective Hanabi players, whether human or AI, must develop a theory of mind to understand their teammates’ beliefs and intentions and predict how they will interpret and respond to their own actions.
Previous work on Hanabi strategies have used variants of RL to produce a strategy the bot follows throughout the entire course of the game. Since the strategy must define an action for every possible situation in the game, even the best RL-based strategies can only roughly approximate an optimal strategy for any particular situation and sometimes choose clearly suboptimal actions. Search offers a way to address this shortcoming.
Our search technique still uses a precomputed full-game strategy but only as a “blueprint” to roughly approximate what will happen later in the game after various actions are taken. It then uses this information to compute an improved strategy in real time for the specific situation it is in.
In order for an agent to correctly determine the best next move, it must know the probability distribution over world states it might be in (that is, the probability distribution over the hidden cards in its hand). Correctly determining this distribution is the most difficult aspect of implementing search in Hanabi. We have developed two approaches to deal with this challenge.
The simplest way to incorporate search in a cooperative, partially observable setting such as Hanabi is to have just one agent conduct search, while all the other players simply play according to the blueprint strategy. We assume in this case that the blueprint strategy — and the fact that all but the searcher plays according to the blueprint strategy — is agreed upon before play begins and is therefore common knowledge among all the agents. Since Hanabi is a fully cooperative game, it is completely safe to make this assumption; because nobody would want to intentionally deceive another player (unlike in poker, for example).
Since the searcher is the only agent determining its strategy in real time while all other agents play a fixed common-knowledge strategy, this is effectively a single-agent setting for the searcher (also known as a partially observable Markov decision process). The searcher maintains a probability distribution over hands it may be holding. Whenever another agent acts, the searcher loops through each hand it could be holding and updates its belief about whether it is in fact holding that hand, based on whether the other agent would have taken the observed action according to the blueprint strategy if the searcher were holding that hand. Each time the searcher must act, it estimates via Monte Carlo rollouts the expected value of each action given the probability distribution over hands. In doing this, the searcher assumes all agents (including the searcher) play according to the blueprint strategy for the remainder of the game.
In this scenario, the searcher calculates the expected value of only the next action. This is commonly referred to as one-ply search. One could achieve even better performance by searching further ahead. However, the computational cost of the search would also increase, especially in a game like Hanabi, which has a large branching factor due to chance. While we would expect stronger performance from deeper search, we have found that even one-ply search is enough to exceed elite human performance.
Single-agent search improves overall performance because it allows the search-enhanced player to make better decisions. But the searcher’s teammates are still using just the blueprint strategy in this scenario and therefore are sometimes still acting suboptimally. To achieve greater performance gains, we developed a way for multiple bots to use search simultaneously while preserving their ability to interpret other players’ actions correctly.
Our search procedure uses exact knowledge of all other agents’ strategies, so the technique described so far cannot be conducted correctly by multiple agents independently. For example, if Bob conducts search on the second turn after Alice conducted search, then Bob’s belief about his probability distribution over hands is incorrect. This is because it assumes Alice played according to the blueprint strategy, while Alice actually played the modified strategy determined via search.
To address this, we developed a more general (and also more complicated) search algorithm we call multi-agent search, which makes it possible for multiple players to correctly conduct search in the same game. The key idea is that agents replicate the search procedures of teammates who acted to see what strategies their search procedures produced.
To illustrate multi-agent search, let’s consider a game between two players, Alice and Bob:
On the first turn, Alice knows from the rules of the game the exact probability distribution over hands she might be holding. So, just as in single-agent search, she can do Monte Carlo rollouts to determine the expected value of each action and choose the highest-value action. This results in Alice’s choosing action A.
Now, before performing search, Bob must determine the probability that he is holding each possible hand. To do that, he needs to know which hands he could be holding that would have resulted in Alice’s choosing action A. Bob does this by looping through each hand he might be holding and seeing what action Alice’s search algorithm on the first turn would have chosen in that case (we call this range search). Once he knows his probability distribution over his possible hands, he can perform search, leading him to choose action B.
On the third turn, Alice needs to determine which hands she could be holding that would have resulted in Bob’s choosing action B. So Alice needs to loop through each of her possible hands and see whether Bob’s search algorithm from the previous turn would have resulted in action B in that case.
But in order for Alice to replicate Bob’s search, she needs to know which cards Bob thinks he might be holding, meaning that she has to replicate Bob’s range search for her action A on the first turn, even though she already knows exactly what cards Bob is holding!
This is a key insight for multi-agent search: Each agent needs to calculate how it would behave even in situations it knows it is not in. For example, Bob may believe it is likely he is holding a red five in his hand. This will affect his decisions, so Alice must account for that possibility even if she can see that Bob doesn’t in fact hold a red five.
Multi-agent search is much more computationally expensive than single-agent search, and this cost increases with the number of possible world states. Fortunately, it is possible to use multi-agent search only when it is computationally feasible and revert to single-agent search otherwise. We do this by having all agents agree beforehand on a “budget” for search. If replicating an agent’s search procedure would exceed this budget, then the agent simply plays according to the blueprint until replicating search would fall under the budget threshold. Since the cost of doing search in any world state is common knowledge, all agents know when an agent is playing according to the blueprint rather than doing search.
The scale of the improvement we observed due to search was far larger than anything we expected. The current state of the art for deep RL in Hanabi is the Simplified Action Decoder developed recently at Facebook AI Research, which achieves an average scores of 24.08 in two-player Hanabi. Adding just single-agent search to a handcrafted heuristic bot that normally achieves 22.99 boosts its score to 24.21, exceeding the performance of every deep RL bot that has been developed. Our strongest agent for two-player Hanabi achieved an average score of 24.61 by adding multi-agent search to SAD. We’ve also found that single-agent search greatly boosts performance in Hanabi with more than two players as well for every blueprint we tested, though conducting multi-agent search would be much more expensive with more players since each player observes more cards.
All our results are for the self-play setting, in which all agents use the same blueprint strategy and search algorithm. However, in informal testing we also found that search makes bots more robust in unexpected off-policy situations, such as when playing with humans. Being able to coordinate with unfamiliar agents following foreign conventions is an important problem for AI research, and one that we hope to explore further.
In the field of AI and games, most previous breakthroughs have occurred in purely adversarial zero-sum environments. But when it comes to deploying AI systems in the real world, we believe cooperation is far more valuable than competition. The success of our search technique, added on top of a strong blueprint strategy for Hanabi, suggests that similar approaches might be useful in other real-world cooperative settings, such as navigating traffic and conversational agents.
Our search technique relies very little on the specific structure of Hanabi, so it can be applied to other situations. As we look to leverage search in new settings, we are particularly excited about two aspects of our technique:
Search is entirely orthogonal to and compatible with advances happening in deep RL. We’ve observed that adding search on top of any strategy in Hanabi produces an even stronger agent.
Leveraging compute at inference time in addition to training time has been an efficient way to improve performance. We’ve found that conducting search at inference time with even a single CPU core on top of a handcrafted blueprint strategy leads to greater performance than any existing deep RL strategy alone.
Since Hanabi has at most 10 million possible hands, it is possible for a player to enumerate all the possible game states consistent with the observable information. More work is needed to make search efficient enough to use in settings where that is not possible. Our search method also does not learn and adapt over the course of multiple games. Adding this ability could make the bot more effective when playing with human teammates, since it could learn their tendencies over time.
Overall, our results with Hanabi reinforce and extend a pattern seen in a number of games, including backgammon, chess, Go, and poker: Adding search to RL can dramatically improve performance beyond what can be achieved through RL alone. We have now shown that this approach can work in cooperative environments as well. We look forward to extending this work into new applications, including potentially those that are neither fully cooperative nor fully competitive, and those that involve natural-language communication with humans.