Source: Deep Learning on Medium
Facebook Builds an AI Agent to Master the Hanabi Challenge: The New Frontier for AI Research
Hanabi is a new game that combines cooperation between players with imperfect information.
Earlier this year, researchers from DeepMind and Google published a paper proposing the game of Hanabi as the new frontier for artificial intelligence(AI) agents. The reason for the designation is that Hanabi combines many of the most difficult challenges for AI models in a single game. The card game operates in an environment of imperfect information and requires collaboration from different players in order to master a task. A few days ago, Facebook AI Research(FAIR) published a paper proposing an AI agent that can achieve state-of-the-art performance in Hanabi while also unveiling new ideas for operating in cooperative environments with imperfect information.
Reading this you might be wondering if we need AI to master yet another game. After all, in recent years, we have witnessed AI agents achieved super human performance in all sorts of games that involve complex strategic analysis like Go, incomplete information like Poker or multi-agent competition like StarCraft. However, in most of those games agents were competing against other agents or humans or simply collaborating with a known team of agents. The challenge of Hanabi is that it requires agents to coordinate with others in a partially-observable environment with limited communication. As humans, we are constantly confronted with those types of situations and we typically address them by formulating a mental model of how other agents will behave in different situations. This is typically known as the theory of mind. To solve Hanabi, AI agents would need to develop communication mechanisms that allows them to cooperate(while competing) in order to achieve a specific goal. But let’s start by understanding the dynamics of the Hanabi game.
Hanabi is a card game created by French game designer Antoine Buza. The game can be seen as a form of “cooperation solitaire” and typically consists of two to five players. Each player holds a hand of four cards (or five, when playing with two or three players). Each card depicts a rank (1 to 5) and a color (red, green, blue, yellow, and white); the deck (set of all cards) is composed of a total of 50 cards, 10 of each color: three 1s, two 2s, 3s, and 4s, and finally a single 5. The goal of the game is to play cards so as to form five consecutively ordered stacks, one for each color, beginning with a card of rank 1 and ending with a card of rank 5. What makes Hanabi special is that, unlike most card games, players can only see their partners’ hands, and not their own.
At any given turn, a Hanabi player can perform one of three actions: : giving a hint, playing a card from their hand, or discarding a card.
I. Hints: The active player can give a hint to any other player. A hint consists of choosing a rank or colour, and indicating to another player all of their cards which match the given rank or color.
II. Discard: Whenever fewer than eight information tokens remain, the active player can discard a card from their hand. The discarded card is placed face up (along with any unsuccessfully played cards), visible to all players.
III. Play: Finally, the active player may pick a card (known or unknown) from their hand and attempt to play it. Playing a card is successful if the card is the next in the sequence of its colour to be played.
Players lose immediately if all fuse tokens are gone, and win immediately if all 5’s have been played successfully. Otherwise play continues until the deck becomes empty, and for one full round after that. At the end of the game, the values of the highest cards in each suit are summed, resulting in a total score out of a possible 25 points.
SPARTA: Solving Hanabi Using Search Strategies
Hanabi combines cooperative gameplay and imperfect information in a multiplayer setting. Effective Hanabi players, whether human or machine, must try to understand the beliefs and intentions of other players, because they can’t see the same cards their teammates see and can only share very limited hints with each other. This is an ideal setting for search algorithms.
Based on that assumption, the FAIR team proposed a strategy known as Search for Partially Observing Teams of Agents (SPARTA). The SPARTA model follows the same principles used during the construction of the Pluribus agent that famous that mastered six-player, no-limit Texas Hold’em. SPARTA also uses a precomputed full-game strategy but only as a “blueprint” to roughly approximate what will happen later in the game after various actions are taken. It then uses this information to compute an improved strategy in real time for the specific situation it is in.
Conceptually, SPARTA consists of two fundamental methods: single-agent search and multi-agent search. Similarly, the multi-agent strategy defines multiple agents can perform search simultaneously but must simulate the search procedure of other agents in order to understand why they took the actions they did.
Single Agent Search
In the single search model, a single agent performs search assuming all other agents play according to the blueprint policy. This allows the search agent to treat the known policy of other agents as part of the environment and maintain beliefs about the hidden information based on others’ actions.
Since the searcher is the only agent determining its strategy in real time while all other agents play a fixed common-knowledge strategy, this is effectively a single-agent setting for the searcher (also known as a partially observable Markov decision process). The searcher maintains a probability distribution over hands it may be holding. Whenever another agent acts, the searcher loops through each hand it could be holding and updates its belief about whether it is in fact holding that hand, based on whether the other agent would have taken the observed action according to the blueprint strategy if the searcher were holding that hand. Each time the searcher must act, it estimates via Monte Carlo rollouts the expected value of each action given the probability distribution over hands. In doing this, the searcher assumes all agents (including the searcher) play according to the blueprint strategy for the remainder of the game.
Single-agent search improves overall performance because it allows the search-enhanced player to make better decisions. However this method has the limitation that the searcher’s teammates are still using just the blueprint strategy in this scenario and therefore are sometimes still acting sub-optimally.
Multi Agent Search
The single-agent search model, assumes that all agents agree beforehand on a blueprint policy, and then also agreeing that only one agent would ever conduct search and deviate from the blueprint. This model has some marked limitations. For instance, , if Bob conducts search on the second turn after Alice conducted search, then Bob’s belief about his probability distribution over hands is incorrect. This is because it assumes Alice played according to the blueprint strategy, while Alice actually played the modified strategy determined via search.
The multi-agent search addresses the main limitation of the single-agent search by allowing it multiple players to correctly conduct search in the same game. The key idea is that agents replicate the search procedures of teammates who acted to see what strategies their search procedures produced. The multi-agent search model assumes that all agents agree beforehand on both a blueprint policy and on what search procedure will be used. When an agent acts and conducts search, the other agents exactly replicate the search procedure conducted by the agent and compute agents resulting policy accordingly.
SPARTA and Hanabi
Being such a new game, there are not a lot of established benchmarks to evaluate the performance on Hanabi. Discussions with highly experienced human players has suggested that top players might achieve perfect scores in 2-player Hanabi somewhere in the range of 60% to 70% of the time when optimizing for perfect scores. The SPARTA agent optimizes for expected value rather than perfect scores and still achieves perfect scores 75.5% of the time in 2-player Hanabi as shown in the following figure..
Hanabi provides a unique environment that requires the collaboration of agents using imperfect information. Some of the ideas that the FAIR team pioneered with SPARTA can be immediately applicable to many real world scenarios. In addition to the paper, the FAIR open sourced the code for the strategy in GitHub.