Professional Documents
Culture Documents
19-Science-Superhuman Poker
19-Science-Superhuman Poker
19-Science-Superhuman Poker
Superhuman AI for multiplayer poker equilibrium also guarantees that the player
will not win in expectation. However, in more
complex games, even determining how to tie
Noam Brown1,2* and Tuomas Sandholm1,3,4,5* against a Nash equilibrium may be difficult; if
the opponent ever chooses suboptimal actions,
In recent years there have been great strides in artificial intelligence (AI), with games often then playing the Nash equilibrium will indeed
serving as challenge problems, benchmarks, and milestones for progress. Poker has served for result in victory in expectation.
decades as such a challenge problem. Past successes in such benchmarks, including poker, In principle, playing the Nash equilibrium can
have been limited to two-player games. However, poker in particular is traditionally played be combined with opponent exploitation by ini-
with more than two players. Multiplayer games present fundamental additional issues tially playing the equilibrium strategy and then
beyond those in two-player games, and multiplayer poker is a recognized AI milestone. In over time shifting to a strategy that exploits the
this paper we present Pluribus, an AI that we show is stronger than top human professionals opponent’s observed weaknesses (for example,
in six-player no-limit Texas hold’em poker, the most popular form of poker played by humans. by switching to always playing Paper against an
opponent that always plays Rock) (11). However,
P
except in certain restricted ways (12), shifting to
a dummy player can be added to the two-player be a specific game-theoretic solution concept Pluribus’s self-play produces a strategy for the
game to make it a three-player zero-sum game). but rather to create an AI that empirically entire game offline, which we refer to as the
Even approximating a Nash equilibrium is hard consistently defeats human opponents, includ- blueprint strategy. Then during actual play against
(except in special cases) in theory (15), and in ing elite human professionals. opponents, Pluribus improves upon the blueprint
games with more than two players, even the best The algorithms that we used to construct strategy by searching for a better strategy in real
complete algorithm can only address games with Pluribus, discussed in the next two sections, time for the situations in which it finds itself
a handful of possible strategies per player (16). are not guaranteed to converge to a Nash equi- during the game. In subsections below, we dis-
Moreover, even if a Nash equilibrium could be librium outside of two-player zero-sum games. cuss both of those phases in detail, but first we
computed efficiently in a game with more than Nevertheless, we observe that Pluribus plays a discuss abstraction, forms of which are used in
two players, it is not clear that playing such an strong strategy in multiplayer poker that is both phases to make them scalable.
equilibrium strategy would be wise. If each capable of consistently defeating elite human
player in such a game independently computes professionals. This shows that even though the Abstraction for large imperfect-
and plays a Nash equilibrium, the list of strat- techniques do not have known strong theoret- information games
egies that they play (one strategy per player) ical guarantees on performance outside of the There are far too many decision points in no-limit
may not be a Nash equilibrium and players two-player zero-sum setting, they are nevertheless Texas hold’em to reason about individually. To
might have an incentive to deviate to a different capable of producing superhuman strategies in reduce the complexity of the game, we eliminate
strategy. One example of this is the Lemonade a wider class of strategic settings. some actions from consideration and also bucket
Stand Game (17), illustrated in Fig. 1, in which similar decision points together in a process called
each player simultaneously picks a point on a Description of Pluribus abstraction (24, 25). After abstraction, the bucketed
ring and wants to be as far away as possible The core of Pluribus’s strategy was computed decision points are treated as identical. We use
from any other player. The Nash equilibrium is through self-play, in which the AI plays against two kinds of abstraction in Pluribus: action abstrac-
Fig. 2. A game tree traversal via Monte Carlo CFR. In this figure, encountered in the middle panel, and P1 updates its strategy at those
player P1 is traversing the game tree. (Left) A game is simulated hypothetical decision points. This process repeats until no new P1
until an outcome is reached. (Middle) For each P1 decision point decision points are encountered, which in this case is after three
encountered in the simulation in the left panel, P1 explores steps but in general may be more. Our implementation of MCCFR
each other action that P1 could have taken and plays out a simulation (described in the supplementary materials) is equivalent but traverses
to the end of the game. P1 then updates its strategy to pick actions the game tree in a depth-first manner. (The percentages in the figure
with higher payoff with higher probability. (Right) P1 explores each are for illustration purposes only and may not correspond to actual
other action that P1 could have taken at every new decision point percentages that the algorithm would compute.)
will rely on its search algorithm (described in a algorithm in which the AI starts by playing com- happened had some other action been chosen
later section) to compute a response in real time pletely at random but gradually improves by instead. This counterfactual reasoning is one of
to such “off-tree” actions. learning to beat earlier versions of itself. Every the features that distinguishes CFR from other
The other form of abstraction that we use in competitive Texas hold’em AI for at least the self-play algorithms that have been deployed
Pluribus is information abstraction, in which past 6 years has computed its strategy using in domains such as Go (9), Dota 2 (20), and
decision points that are similar in terms of what some variant of CFR (4–6, 23, 28, 30–34). We StarCraft 2 (21).
information has been revealed (in poker, the use a form of Monte Carlo CFR (MCCFR) that The difference between what the traverser
player’s cards and revealed board cards) are samples actions in the game tree rather than would have received for choosing an action
bucketed together and treated identically (26–28). traversing the entire game tree on each iteration versus what the traverser actually achieved (in
For example, a 10-high straight and a 9-high (33, 35–37). expectation) on the iteration is added to the
straight are distinct hands but are nevertheless On each iteration of the algorithm, MCCFR counterfactual regret for the action. Counter-
strategically similar. Pluribus may bucket these designates one player as the traverser whose factual regret represents how much the tra-
hands together and treat them identically, thereby current strategy is updated on the iteration. At verser regrets having not chosen that action
reducing the number of distinct situations for the start of the iteration, MCCFR simulates a in previous iterations. At the end of the iter-
which it needs to determine a strategy. Infor- hand of poker based on the current strategy of ation, the traverser’s strategy is updated so that
mation abstraction drastically reduces the com- all players (which is initially completely random). actions with higher counterfactual regret are
plexity of the game, but it may wash away subtle Once the simulated hand is completed, the AI chosen with higher probability.
differences that are important for superhuman reviews each decision that was made by the For two-player zero-sum games, CFR guar-
performance. Therefore, during actual play against traverser and investigates how much better antees that the average strategy played over all
humans, Pluribus uses information abstraction or worse it would have done by choosing the iterations converges to a Nash equilibrium, but
only to reason about situations on future betting other available actions instead. Next, the AI convergence to a Nash equilibrium is not guar-
Fig. 4. Real-time search in Pluribus. The subgame shows just two players that real-time search is conducted). The leaf nodes are replaced by a
for simplicity. A dashed line between nodes indicates that the player to sequence of new nodes in which each player still in the hand chooses
and computation would enable a finer-grained until a leaf node is reached at the depth limit of chooses in the subgame (that is, the probabilities
blueprint that would lead to better performance the algorithm’s lookahead. An evaluation func- that the searcher assigns to his actions in the
but would also result in Pluribus using more tion then estimates the value of the board con- subgame). In principle, this could be addressed
memory or being slower during real-time search. figuration at the leaf node if both players were by having the value of a subgame leaf node be
We set the size of the blueprint strategy abstrac- to play a Nash equilibrium from that point a function of the searcher’s strategy in the sub-
tion to allow Pluribus to run during live play on a forward. In principle, if an AI could accurately game, but this is impractical in large games.
machine with no more than 128 GB of memory calculate the value of every leaf node (e.g., win, One alternative is to make the value of a leaf
while storing a compressed form of the blueprint draw, or loss), this algorithm would choose the node conditional only on the belief distribu-
strategy in memory. optimal next move. tion of both players at that point in the game.
However, search as has been done in perfect- This was used to generate the two-player poker
Depth-limited search in information games is fundamentally broken AI DeepStack (5). However, this option is ex-
imperfect-information games when applied to imperfect-information games. tremely expensive because it requires one to
The blueprint strategy for the entire game is For example, consider a sequential form of Rock- solve huge numbers of subgames that are con-
necessarily coarse-grained owing to the size and Paper-Scissors, illustrated in Fig. 3, in which ditional on beliefs. It becomes even more ex-
complexity of no-limit Texas hold’em. Pluribus player 1 acts first but does not reveal her action to pensive as the amount of hidden information
only plays according to this blueprint strategy player 2, followed by player 2 acting. If player or the number of players grows. The two-player
in the first betting round (of four), where the 1 were to conduct search that looks just one move poker AI Libratus sidestepped this issue by only
number of decision points is small enough that ahead, every one of her actions would appear to doing real-time search when the remaining game
the blueprint strategy can afford to not use in- lead to a leaf node with zero value. After all, if was short enough that the depth limit would ex-
formation abstraction and have a lot of actions player 2 plays the Nash equilibrium strategy of tend to the end of the game (6). However, as the
in the action abstraction. After the first round choosing each action with 31 probability, the number of players grows, always solving to the
(and even in the first round if an opponent value to player 1 of choosing Rock is zero, as end of the game also becomes computationally
chooses a bet size that is sufficiently different is the value of choosing Scissors. So player 1’s prohibitive.
from the sizes in the blueprint action abstrac- search algorithm could choose to always play Pluribus instead uses a modified form of an
tion), Pluribus instead conducts real-time search Rock because, given the values of the leaf nodes, approach that we recently designed—previously
to determine a better, finer-grained strategy this appears to be equally good as any other only for two-player zero-sum games (41)—in
for the current situation it is in. For opponent strategy. which the searcher explicitly considers that
bets on the first round that are slightly off the Indeed, if player 2’s strategy were fixed to al- any or all players may shift to different strat-
tree, Pluribus rounds the bet to a nearby on- ways playing the Nash equilibrium, always play- egies beyond the leaf nodes of a subgame. Spe-
tree size [using the pseudoharmonic mapping ing Rock would be an optimal player 1 strategy. cifically, rather than assuming that all players
(39)] and proceeds to play according to the However, in reality, player 2 could adjust to a play according to a single fixed strategy beyond
blueprint as if the opponent had used the latter strategy of always playing Paper. In that case, the leaf nodes (which results in the leaf nodes
bet size. the value of always playing Rock would actual- having a single fixed value), we instead assume
Real-time search has been necessary for ly be 1. that each player may choose between k differ-
achieving superhuman performance in many This example illustrates that in imperfect- ent strategies, specialized to each player, to
perfect-information games, including backgam- information subgames (the part of the game play for the remainder of the game when a
mon (18), chess (8), and Go (9, 19). For example, in which search is being conducted) (40), leaf leaf node is reached. In the experiments in this
when determining their next move, chess AIs nodes do not have fixed values. Instead, their paper, k ¼ 4 . One of the four continuation
commonly look some number of moves ahead values depend on the strategy that the searcher strategies that we use in the experiments is the
Experimental evaluation
We evaluated Pluribus against elite human profes-
sionals in two formats: five human professionals
was determined to be profitable with a p val- and considerably more difficult both theoret- 28. N. Brown, S. Ganzfried, T. Sandholm, in International
ue of 0.028. The performance of Pluribus over ically and practically. Developing a superhuman Conference on Autonomous Agents and Multiagent Systems
(AAMAS) (2015), pp. 7–15.
the course of the experiment is shown in Fig. 5. AI for multiplayer poker was a widely recog- 29. M. Zinkevich, M. Johanson, M. H. Bowling, C. Piccione, in
(Owing to the extremely high variance in no- nized milestone in this area and the major Neural Information Processing Systems (NeurIPS) (2007),
limit poker and the impossibility of applying remaining milestone in computer poker. In pp. 1729–1736.
AIVAT to human players, the win rate of in- this paper we described Pluribus, an AI ca- 30. E. G. Jackson, AAAI Workshop on Computer Poker and
Imperfect Information (2013).
dividual human participants could not be de- pable of defeating elite human professionals 31. M. B. Johanson, Robust strategies and counter-strategies:
termined with statistical significance.) in six-player no-limit Texas hold’em poker, the from superhuman to optimal play, Ph.D. thesis, University of
The human participants in the 1H+5AI ex- most commonly played poker format in the world. Alberta (2016).
periment were Chris “Jesus” Ferguson and Pluribus’s success shows that despite the lack 32. E. G. Jackson, AAAI Workshop on Computer Poker and
Imperfect Information (2016).
Darren Elias. Each of the two humans sep- of known strong theoretical guarantees on 33. N. Brown, T. Sandholm, in International Joint Conference on
arately played 5000 hands of poker against performance in multiplayer games, there are Artificial Intelligence (IJCAI) (2016), pp. 4238–4239.
five copies of Pluribus. Pluribus does not large-scale, complex multiplayer imperfect- 34. E. G. Jackson, AAAI Workshop on Computer Poker and
adapt its strategy to its opponents and does information settings in which a carefully con- Imperfect Information Games (2017).
not know the identity of its opponents, so the structed self-play-with-search algorithm can 35. M. Lanctot, K. Waugh, M. Zinkevich, in M. Bowling,
Neural Information Processing Systems (NeurIPS) (2009),
copies of Pluribus could not intentionally col- produce superhuman strategies. pp. 1078–1086.
lude against the human player. To incentivize 36. M. Johanson, N. Bard, M. Lanctot, R. Gibson, M. Bowling, in
strong play, we offered each human $2000 International Conference on Autonomous Agents and Multiagent
for participation and an additional $2000 if RE FERENCES AND NOTES Systems (AAMAS) (2012), pp. 837–846.
1. D. Billings, A. Davidson, J. Schaeffer, D. Szafron, Artif. Intell. 37. R. Gibson, M. Lanctot, N. Burch, D. Szafron, M. Bowling, in
he performed better against the AI than the AAAI Conference on Artificial Intelligence (AAAI) (2012),
134, 201–240 (2002).
other human player did. The players did not
SUPPLEMENTARY http://science.sciencemag.org/content/suppl/2019/07/10/science.aay2400.DC1
MATERIALS
RELATED http://science.sciencemag.org/content/sci/365/6456/864.full
CONTENT
REFERENCES This article cites 21 articles, 5 of which you can access for free
http://science.sciencemag.org/content/365/6456/885#BIBL
PERMISSIONS http://www.sciencemag.org/help/reprints-and-permissions
Science (print ISSN 0036-8075; online ISSN 1095-9203) is published by the American Association for the Advancement of
Science, 1200 New York Avenue NW, Washington, DC 20005. 2017 © The Authors, some rights reserved; exclusive
licensee American Association for the Advancement of Science. No claim to original U.S. Government Works. The title
Science is a registered trademark of AAAS.