Reinforcement Learning Adventures

Source: Deep Learning on Medium

The network was once again a Deep Q-Network, but with a twist. DQNs suffer from being very shaky in their performance because what they are trying to learn keeps changing. You can think of this as a scenario of moving goalposts. Every time you get closer, the goalpost itself moves further.

To help alleviate the shakiness, van Hasselt et al. (arXiv:1509.06461 [cs.LG]) came up with a solution in 2015: have two networks, one online and one “target”. The idea was that the online network would train using a fixed target network as the goalpost. In intervals of n moves, the target network would get updated.

The network was two convolutional layers followed by three dense layers. The input was the current game state as a 6-by-7 array, with zero meaning empty, -1 meaning black, and 1 meaning red. It had 7 outputs, one for each column. At the end, an action mask is applied to prevent the network from playing illegal moves (like when a column is full).

When training, the network plays against itself repeatedly, getting a positive reward when it wins a game and a negative reward when it loses. Every 1000 moves, it also plays 100 games against a random player (one that picks moves randomly) to calculate its win-rate. (More about self-play here and here.)

A brief clip of the network learning.

Despite my best efforts, however, the network couldn’t exceed an 80-ish% win-rate. It also made some seemingly glaring mistakes, like leaving three coins in a row instead of blocking them. My best guesses are that the network wasn’t complicated enough to learn the game or that it just hadn’t trained for long enough.

Quest: Othello

Connect 4 was cool, but there still wasn’t enough room for the network to explore and innovate. Sticking with the board-games idea, my mind first went to chess. I discarded it, however, because I thought it was too complicated for my naïve bot to learn in a reasonable time period. The game I settled on was Othello. The game was easy enough rule-wise, but had a lot of room for mastery. Heck, the game’s tagline was “A minute to learn, a lifetime to master.”

Because Othello was still symmetric, I wanted the same network to play both sides of the game. So once again, the environment was setup to make both players think they were playing to make white win.

One of the tricky things to do with Othello was determining how to do rewards. In Chess and Connect 4, the person to play the last move is generally the winner, but in Othello, it doesn’t really matter who plays the finishing move. What matters is what majority of the board you control so I originally thought it could just be how many pieces you turn over with one move.

But that wouldn’t work, because it is possible that your move sets up the opponent for many more points. I decided that the best reward would dependent on the pieces that your move got as well as the pieces that your opponent got because of it.

How I decide the reward of a move.
This crudely-drawn image illustrates how the reward is calculated.

Initially, I applied the same Double DQN strategy to Othello, just with more (three) convolutional layers and more (five) dense layers. But even after 12 hours of training, there was no substantial improvement in win-rate. It was still idling around 50%, which is pretty disappointing against a player who literally decides randomly.

So I researched ways to improve this. One of the first things I came across was Dueling DQNs (arXiv:1511.06581 [cs.LG]). In these, the Q-value is split into two different parts: value and advantage. Value talks about the value of being in a certain state, like your current score in Super Mario, or the number of coins you have in Othello. Advantage focuses on the expected reward of performing a certain action, like jumping or playing in a specific spot. By splitting these up, the network should in theory be able to better model the Q function, making it perform better. I modified my network structure to include the dueling architecture in addition to the already-existing Double DQN, making it a (comical) DDDQN.