Original article can be found here (source): Deep Learning on Medium
AlphaGo Zero: Mastering the game of Go without human knowledge
This post is part of my Deepmind and OpenAI papers series. It’s a summary of AlphaGo-Zero paper from Deepmind. This paper uses self-play Reinforcement Learning to learn Go and they showed that pure reinforcement learning from random weights and without any human demonstration can achieve super-human level performance in the game of Go, winning 100–0 against AlphaGo algorithm.
There are several differences between AlphaGo-Zero and AlphaGo:
- It is trained only using self-play RL without any supervision and human data.
- It uses only the black and white stones from the board as input features.
- It uses one single neural network, rather than separate policy and value networks.
- It uses a simpler tree search that relies upon this single neural network to evaluate positions and sample move, without Monte Carlo rollouts.
The single neural net, f_theta, takes raw board representation, s, and its history as input and outputs move probabilities and value, p and v → (p, v) = f_theta(s)
Which v is the probability of the current player to win the game from position s.
This neural net is trained using self-play RL. The following figure shows the self-play algorithm used in AlphaGo-Zero:
- First, the program plays a game s_1, …, s_T against itself.
- It uses alpha_theta, which is an MCTS policy and uses the latest f_theha, to select moves in each state s_t and continue the game until the terminal state s_T and find out the winner z.
- Then the neural net parameters are updated by minimizing the error between z and v, and also maximizing the similarity between the probabilities from alpha_theta and p.
The following figure shows the MCTS algorithm used in AlphaGo-Zero:
- In each state s, the edge with higher action value plus upper confidence bound, which depends on a stored prior probability p and visit count N for that edge, is selected.
- The game continues and if it reaches a leaf node s_L, expands it using the neural net, f_theta, to estimate p and v, and store p for outgoing edges from s_L.
- Then the visit count and action value are updated for the subtree below the selected action at the root node.
- Finally, it can use the updated MCTS to play and select moves …
It was a very simple review of the paper. For more details about the method, training, and different experiments, please refer to the original paper.