Source: Deep Learning on Medium

This time, we will program our AI to explore the game states semi-randomly instead of recursively exploring every state. For a simple game like tic-tac-toe, this method actually ends up taking much longer to train to perfection than using recursion. However, for more complicated games where it just isn’t feasible to explore every game state, semi-random exploration allows your AI to prioritize optimal game paths without getting stuck on locally optimal paths. The code for this module can be found at https://github.com/rnbrown/Reinforcement-Learning/, and we will be using the files TicTacToe.ipynb and QLearning.py.

This time, our approach is to simulate semi-random games and have the AI update our Q matrix at each step. We create a function *simulate_game() *which takes the following steps:

- Create a new game board and initialize Q with random values between -.15 and .15.
- Call
*action()*which returns a random move*self.random_rate=2*0% of the time and returns the best move according to the current state of the Q matrix the other (1-*self.random_rate)=*20% of the time. - Check if the move,
*xy,*is valid according to the game rules. If the move is valid, proceed to step 4. If not, set*self.Q[s][move]*to +.5 or -.5 for player 0 and 1 respectively so the AI knows not to pick that move, and return to step 2. - Update the board with
*xy*, so the previous state*s*is now*s_prev,*and*s*is the new state. - Call
*updateQ().* - Repeat Steps 2–6 until the game is done.

Our function *updateQ() *runs after each move in our simulated games and uses 2 of our parameters, *self.learning_rate=.8 *and *self.decay_rate=.2 . *If the previous move ended the game, *Q[s_prev][move] *is set equal to the score. Else, we will take either the *Q* value of the next player’s optimal move (if player 0 made the last move, player 1 will move next, so we take the max of *Q[s] *and vice versa*),* multiply that value by the decay rate, and call it *expected*. So *self.Q[s_prev][move] *becomes (1-*self.learning_rate)***self.Q[s_prev][move] + self.learning_rate * expected.*

So now, all that’s left to do is call *simulate_game() *repeatedly and pray that the Q matrix converges on the correct solution. I found that after 300,000 games the AI was meeting my standard of never losing a game.