AlphaGo Zero: Mastering the game of Go without human knowledge

Original article can be found here (source): Deep Learning on Medium

AlphaGo Zero: Mastering the game of Go without human knowledge

This post is part of my Deepmind and OpenAI papers series. It’s a summary of AlphaGo-Zero paper from Deepmind. This paper uses self-play Reinforcement Learning to learn Go and they showed that pure reinforcement learning from random weights and without any human demonstration can achieve super-human level performance in the game of Go, winning 100–0 against AlphaGo algorithm.

There are several differences between AlphaGo-Zero and AlphaGo:

  • It is trained only using self-play RL without any supervision and human data.
  • It uses only the black and white stones from the board as input features.
  • It uses one single neural network, rather than separate policy and value networks.
  • It uses a simpler tree search that relies upon this single neural network to evaluate positions and sample move, without Monte Carlo rollouts.

The single neural net, f_theta, takes raw board representation, s, and its history as input and outputs move probabilities and value, p and v(p, v) = f_theta(s)

Which v is the probability of the current player to win the game from position s.

This neural net is trained using self-play RL. The following figure shows the self-play algorithm used in AlphaGo-Zero:

  • First, the program plays a game s_1, …, s_T against itself.
  • It uses alpha_theta, which is an MCTS policy and uses the latest f_theha, to select moves in each state s_t and continue the game until the terminal state s_T and find out the winner z.
  • Then the neural net parameters are updated by minimizing the error between z and v, and also maximizing the similarity between the probabilities from alpha_theta and p.

The following figure shows the MCTS algorithm used in AlphaGo-Zero:

  • In each state s, the edge with higher action value plus upper confidence bound, which depends on a stored prior probability p and visit count N for that edge, is selected.
  • The game continues and if it reaches a leaf node s_L, expands it using the neural net, f_theta, to estimate p and v, and store p for outgoing edges from s_L.
  • Then the visit count and action value are updated for the subtree below the selected action at the root node.
  • Finally, it can use the updated MCTS to play and select moves …

It was a very simple review of the paper. For more details about the method, training, and different experiments, please refer to the original paper.