I’m Gergely, 22, finished my Bachelors in Computer Science not long ago. A few weeks ago my friend, Gergely Halacsy and I have re-implemented the algorithm of AlphaZero, but used very simple tools only, just enough to make it work. In this post I’d like to talk about what I experienced, and how the implementation changed my thoughts about AlphaZero.
Idea and experience
Don’t implement such algorithm without standard Q-learning knowledge!
Just a quick note on this. Q-learning was part of my university’s curriculum, therefore I learnt about it, but never used it. Although I understood nearly everything that was written in DeepMind’s papers, but got lost in the implementation. Thus, I decided to take a step back, went through Arthur Juliani’s guideline on deep Q-learning, did some coding and then returned to the AlphaZero project with clear mind.
In the beginning I was so happy about teaching a neural net from self-play. The engine studies a game, and it needs to use its knowledge to play exactly the same game. Therefore, for some reason I believed overfitting is not possible in this environment — but it is!
Because I was not in possession of a GPU, my first impression was that I can simplify the algorithm by reducing the number of episodes, the number simulations and the memory size that was originally used in AlphaZero. I was right, although I didn’t think it through what happens if these numbers go below the essential.
If the number of episodes is low or the memory is not large enough to cover the last couple of iterations, the next generation of the algorithm cannot learn the general concept of the game, but can only learn a few moves. Thus, if we match the possible next generation network against the current best one, a new move can confuse the challenging network and will never be able to find the way to victory.
From another point of view, we can also overfit the game if the number of simulations are too low. It’s possible in the beginning of teaching that the Dirichlet noise makes the engine prioritize a certain moves over the others. Consequently many possible moves remain non-visited and the policy network learns that these should never be visited again from that state. The number of simulations must be large enough to give a good target, a reasonable statistics for the next generation’s policy network, but at the same time it needs to be low so the engine can roll the games fast.
It sounds nice that AlphaZero needs only 4 TPUs compared to AlphaGo’s massive computational power. On the other hand, DeepMind used way more resources to train their engine, and people tend to forget that.
If any of you want to implement AlphaZero at home, you should really consider tiny games with small action space, otherwise your computer would need hundreds, maybe thousands of years to learn the game well.
We have run the algorithm on a well-known game called Connect-4. The engine started from scratch and was studying for 60 hours on a Intel® Core™ 2 Quad, 2.40 GHz CPU. Interestingly while the engine was exploring the game it picked up different trends and openings.
In order to test its real performance, we took the 24-hour-generation version of the engine, increased the number of simulations from 100 to 1200 and played it against our colleagues, meaning a group of data scientists. That generation acquired a knowledge that was enough to beat the human team 5 to 1.
Although the DeepMind’s algorithm is quite fascinating in itself, in the future we’re planning to try out some possible improvements that might speed up the training.
Further details on these improvements can be found on the git repo. Unfortunately we have not prepared any quick-run scripts for the curious developers, but it’s on the todo list.
Source: Deep Learning on Medium