ReBeL - Combining Deep Reinforcement Learning and Search for Imperfect-Information Games (Explained)

35,019
0
Published 2020-12-16
#ai #technology #poker

This paper does for Poker what AlphaZero has done for Chess & Go. The combination of Self-Play Reinforcement Learning and Tree Search has had tremendous success in perfect-information games, but transferring such techniques to imperfect information games is a hard problem. Not only does ReBeL solve this problem, but it provably converges to a Nash Equilibrium and delivers a superhuman Heads Up No-Limit Hold'em bot with very little domain knowledge.

OUTLINE:
0:00 - Intro & Overview
3:20 - Rock, Paper, and Double Scissor
10:00 - AlphaZero Tree Search
18:30 - Notation Setup: Infostates & Nash Equilibria
31:45 - One Card Poker: Introducing Belief Representations
45:00 - Solving Games in Belief Representation
55:20 - The ReBeL Algorithm
1:04:00 - Theory & Experiment Results
1:07:00 - Broader Impact
1:10:20 - High-Level Summary

Paper: arxiv.org/abs/2007.13544
Code: github.com/facebookresearch/rebel
Blog: ai.facebook.com/blog/rebel-a-general-game-playing-…

ERRATA: As someone last video pointed out: This is not the best Poker algorithm, but the best one that uses very little expert knowledge.

Abstract:
The combination of deep reinforcement learning and search at both training and test time is a powerful paradigm that has led to a number of successes in single-agent settings and perfect-information games, best exemplified by AlphaZero. However, prior algorithms of this form cannot cope with imperfect-information games. This paper presents ReBeL, a general framework for self-play reinforcement learning and search that provably converges to a Nash equilibrium in any two-player zero-sum game. In the simpler setting of perfect-information games, ReBeL reduces to an algorithm similar to AlphaZero. Results in two different imperfect-information games show ReBeL converges to an approximate Nash equilibrium. We also show ReBeL achieves superhuman performance in heads-up no-limit Texas hold'em poker, while using far less domain knowledge than any prior poker AI.

Authors: Noam Brown, Anton Bakhtin, Adam Lerer, Qucheng Gong

Links:
YouTube: youtube.com/c/yannickilcher
Twitter: twitter.com/ykilcher
Discord: discord.gg/4H8xxDF
BitChute: www.bitchute.com/channel/yannic-kilcher
Minds: www.minds.com/ykilcher
Parler: parler.com/profile/YannicKilcher
LinkedIn: www.linkedin.com/in/yannic-kilcher-488534136/

If you want to support me, the best thing to do is to share out the content :)

If you want to support me financially (completely optional and voluntary, but a lot of people have asked for this):
SubscribeStar: www.subscribestar.com/yannickilcher
Patreon: www.patreon.com/yannickilcher
Bitcoin (BTC): bc1q49lsw3q325tr58ygf8sudx2dqfguclvngvy2cq
Ethereum (ETH): 0x7ad3513E3B8f66799f507Aa7874b1B0eBC7F85e2
Litecoin (LTC): LQW2TRyKYetVC8WjFkhpPhtpbDM4Vw7r9m
Monero (XMR): 4ACL8AGrEo5hAir8A9CeVrW8pEauWvnp1WnSDZxW7tziCDLhZAGsgzhRQABDnFy8yuM9fWJDviJPHKRjV4FWt19CJZN9D4n

All Comments (21)
  • @jahcane3711
    I love the eye you drew Yannic. Great review, thanks
  • @oncedidactic
    That high level summary section at the end is super fun once you've run through the preceding explanation, it's like a tony hawk perfect run. XD
  • @herstar9510
    This was simultaneously interesting and I couldnt understand it at the same time.
  • @binjianxin7830
    A POMDP is not an MDP. The infostate is not a sufficient state of the history. The whole trick is to transform the infostate formulation into a sufficient one.That’s why the early strategy of the opponent needs to be considered and hence the added complexity and the computation. This could be clarified in the first place.
  • @AP-dc1ks
    Self play :| Imperfect information :) Provably converges :O
  • @robindebreuil
    Lisa: Poor predictable Bart, always takes rock. Bart: ROCK! NOTHING BEATS ROCK!
  • @23kl104
    I like the drawing of your eye, it's got some sinister watchfulness
  • Cool paper. IMHO the r-p-s example is a bit of underwhelming. There is a fundamental difference between games like chess and rock-paper-scissors. in chess the moves in a sequence are way more dependent on each other whereas in r-p-s it's somewhat independent just like 50 throws of a coin are independent of each other. Please correct me if I am wrong. Assuming the results of every game independent of each other (a game consists of one move from each player), picking rock is more potent for p1 (outcomes: 0, -1, +2). Then upon observing p1 picking rock often p2 can up the probability of picking paper more and so on. Eventually, though it should come to the same (04, 0.4, 0.2) equilibrium, starting with somewhat like (1.0, 0.0, 0.0) instead of the authors' suggestion of something like (0.0, 0.0, 1.0).
  • Dreaming of a PyTorch / TensorFlow implementation on github for christmas...
  • @JTMoustache
    To understand the first part of this video it is useful to look at the definition of Nash equilibrium and Pareto Optimal strategy. The coursera Game Theory course (1st and 2nd week) is great to understand those concepts.
  • @XOPOIIIO
    Can somebody inform where do you get the latest ML news about developments in the field, the latest research papers that was discussed etc?
  • @jahcane3711
    Did you mean it when you said this takes heaps of compute, don't try this at home? By that I mean, does it really take THAT much compute?
  • @herp_derpingson
    I still dont fully understand the architecture. Are we differentiating through the CFR? What is the loss function?