An Intuitive Guide to Reinforcement Learning — Codealong Pt. 2

Source: Deep Learning on Medium

Go to the profile of Russell

Part 1:

This time, we will program our AI to explore the game states semi-randomly instead of recursively exploring every state. For a simple game like tic-tac-toe, this method actually ends up taking much longer to train to perfection than using recursion. However, for more complicated games where it just isn’t feasible to explore every game state, semi-random exploration allows your AI to prioritize optimal game paths without getting stuck on locally optimal paths. The code for this module can be found at, and we will be using the files TicTacToe.ipynb and

This time, our approach is to simulate semi-random games and have the AI update our Q matrix at each step. We create a function simulate_game() which takes the following steps:

  1. Create a new game board and initialize Q with random values between -.15 and .15.
  2. Call action() which returns a random move self.random_rate=20% of the time and returns the best move according to the current state of the Q matrix the other (1-self.random_rate)=20% of the time.
  3. Check if the move, xy, is valid according to the game rules. If the move is valid, proceed to step 4. If not, set self.Q[s][move] to +.5 or -.5 for player 0 and 1 respectively so the AI knows not to pick that move, and return to step 2.
  4. Update the board with xy, so the previous state s is now s_prev, and s is the new state.
  5. Call updateQ().
  6. Repeat Steps 2–6 until the game is done.

Our function updateQ() runs after each move in our simulated games and uses 2 of our parameters, self.learning_rate=.8 and self.decay_rate=.2 . If the previous move ended the game, Q[s_prev][move] is set equal to the score. Else, we will take either the Q value of the next player’s optimal move (if player 0 made the last move, player 1 will move next, so we take the max of Q[s] and vice versa), multiply that value by the decay rate, and call it expected. So self.Q[s_prev][move] becomes (1-self.learning_rate)*self.Q[s_prev][move] + self.learning_rate * expected.

So now, all that’s left to do is call simulate_game() repeatedly and pray that the Q matrix converges on the correct solution. I found that after 300,000 games the AI was meeting my standard of never losing a game.