ReBeL - Combining Deep Reinforcement Learning and Search for Imperfect-Information Games (Explained)

35,019

1,042 0

Published 2020-12-16

#ai #technology #poker

This paper does for Poker what AlphaZero has done for Chess & Go. The combination of Self-Play Reinforcement Learning and Tree Search has had tremendous success in perfect-information games, but transferring such techniques to imperfect information games is a hard problem. Not only does ReBeL solve this problem, but it provably converges to a Nash Equilibrium and delivers a superhuman Heads Up No-Limit Hold'em bot with very little domain knowledge.

OUTLINE:
0:00 - Intro & Overview
3:20 - Rock, Paper, and Double Scissor
10:00 - AlphaZero Tree Search
18:30 - Notation Setup: Infostates & Nash Equilibria
31:45 - One Card Poker: Introducing Belief Representations
45:00 - Solving Games in Belief Representation
55:20 - The ReBeL Algorithm
1:04:00 - Theory & Experiment Results
1:07:00 - Broader Impact
1:10:20 - High-Level Summary

Paper: arxiv.org/abs/2007.13544
Code: github.com/facebookresearch/rebel
Blog: ai.facebook.com/blog/rebel-a-general-game-playing-…

ERRATA: As someone last video pointed out: This is not the best Poker algorithm, but the best one that uses very little expert knowledge.

Abstract:
The combination of deep reinforcement learning and search at both training and test time is a powerful paradigm that has led to a number of successes in single-agent settings and perfect-information games, best exemplified by AlphaZero. However, prior algorithms of this form cannot cope with imperfect-information games. This paper presents ReBeL, a general framework for self-play reinforcement learning and search that provably converges to a Nash equilibrium in any two-player zero-sum game. In the simpler setting of perfect-information games, ReBeL reduces to an algorithm similar to AlphaZero. Results in two different imperfect-information games show ReBeL converges to an approximate Nash equilibrium. We also show ReBeL achieves superhuman performance in heads-up no-limit Texas hold'em poker, while using far less domain knowledge than any prior poker AI.

Authors: Noam Brown, Anton Bakhtin, Adam Lerer, Qucheng Gong

Links:
YouTube: youtube.com/c/yannickilcher
Twitter: twitter.com/ykilcher
Discord: discord.gg/4H8xxDF
BitChute: www.bitchute.com/channel/yannic-kilcher
Minds: www.minds.com/ykilcher
Parler: parler.com/profile/YannicKilcher
LinkedIn: www.linkedin.com/in/yannic-kilcher-488534136/

If you want to support me, the best thing to do is to share out the content :)

If you want to support me financially (completely optional and voluntary, but a lot of people have asked for this):
SubscribeStar: www.subscribestar.com/yannickilcher
Patreon: www.patreon.com/yannickilcher
Bitcoin (BTC): bc1q49lsw3q325tr58ygf8sudx2dqfguclvngvy2cq
Ethereum (ETH): 0x7ad3513E3B8f66799f507Aa7874b1B0eBC7F85e2
Litecoin (LTC): LQW2TRyKYetVC8WjFkhpPhtpbDM4Vw7r9m
Monero (XMR): 4ACL8AGrEo5hAir8A9CeVrW8pEauWvnp1WnSDZxW7tziCDLhZAGsgzhRQABDnFy8yuM9fWJDviJPHKRjV4FWt19CJZN9D4n

All Comments (21)

@connorshorten6311 3 years ago

Pumped for this!!
@jahcane3711 3 years ago

I love the eye you drew Yannic. Great review, thanks
@oncedidactic 3 years ago

That high level summary section at the end is super fun once you've run through the preceding explanation, it's like a tony hawk perfect run. XD
@hanabimock5193 3 years ago

Thanks Yannick, your explanations are the best
@stefanvasilev8948 3 years ago

I love your channel, keep up the good work!
@herstar9510 2 years ago

This was simultaneously interesting and I couldnt understand it at the same time.
@binjianxin7830 3 years ago

A POMDP is not an MDP. The infostate is not a sufficient state of the history. The whole trick is to transform the infostate formulation into a sufficient one.That’s why the early strategy of the opponent needs to be considered and hence the added complexity and the computation. This could be clarified in the first place.
@AP-dc1ks 3 years ago

Self play :| Imperfect information :) Provably converges :O
@MrAquaktus 3 years ago

Amazing! Please keep doing these
@robindebreuil 3 years ago

Lisa: Poor predictable Bart, always takes rock. Bart: ROCK! NOTHING BEATS ROCK!
@23kl104 3 years ago

I like the drawing of your eye, it's got some sinister watchfulness
@mdmishfaqahmed8356 3 years ago

Cool paper. IMHO the r-p-s example is a bit of underwhelming. There is a fundamental difference between games like chess and rock-paper-scissors. in chess the moves in a sequence are way more dependent on each other whereas in r-p-s it's somewhat independent just like 50 throws of a coin are independent of each other. Please correct me if I am wrong. Assuming the results of every game independent of each other (a game consists of one move from each player), picking rock is more potent for p1 (outcomes: 0, -1, +2). Then upon observing p1 picking rock often p2 can up the probability of picking paper more and so on. Eventually, though it should come to the same (04, 0.4, 0.2) equilibrium, starting with somewhat like (1.0, 0.0, 0.0) instead of the authors' suggestion of something like (0.0, 0.0, 1.0).
@BernhardSchlegel 3 years ago

Dreaming of a PyTorch / TensorFlow implementation on github for christmas...
@brandont643 2 years ago

Your a legend
@JTMoustache 3 years ago

To understand the first part of this video it is useful to look at the definition of Nash equilibrium and Pareto Optimal strategy. The coursera Game Theory course (1st and 2nd week) is great to understand those concepts.
@ozgeozcelik8921 3 years ago

Kaiji Ultimate survivor comes to mind
@XOPOIIIO 3 years ago

Can somebody inform where do you get the latest ML news about developments in the field, the latest research papers that was discussed etc?
@BigHorse4200 1 year ago

Still don't get how you run CFR on a PBS?
@jahcane3711 3 years ago

Did you mean it when you said this takes heaps of compute, don't try this at home? By that I mean, does it really take THAT much compute?
@herp_derpingson 3 years ago

I still dont fully understand the architecture. Are we differentiating through the CFR? What is the loss function?