Playing cards with Reinforcement Learning (1/3)

Original article was published on Artificial Intelligence on Medium

Enough chit chat, let’s do some real RL 🚀

If you better understand with code, you can directly go to the commented notebook 👇

I. The environment

First thing first, we must define our environment where our agent will learn which action (hit or stick) to pick in a given state in order to beat the dealer.
The environment simply encodes the rules of Easy21. You can refer to the notebook if you want to see the details.

That’s great! But now that we have an environment for our agent, how do we learn the optimal policy inside it?

II. The Monte-Carlo Control approach

Here, I will explain Monte-Carlo Control concept in plain English only. However, you can go and check the notebook to dive into the details of the implementation.

Let’s first demystify these terms.

  • Monte-Carlo is a fancy name to say that we are going to sample episodes (Easy21 game sequences in our case).
  • Control means we are going to find the optimal policy, i.e the best action to pick in any state to maximize our winning chances.

Monte-Carlo Control approach samples an episode <-> plays a game of Easy21 using a “current” strategy, and look at the reward of the terminal state.
Then, very simply, it will update the expected reward of each action-state pair encountered in the sampled episode.

The update seen from the inside of the agent’s head: “if the played game was a win (resp. loss), the values of the action-state sequence I have taken should be increased (resp. decreased)”.

Given that the Q value function has been updated, we can build a “new” strategy and we are ready to sample a new episode using this “new” strategy.

Repeat this sampling/updating procedure until you reach a number of fixed episodes or until the new strategy doesn’t change from the current one… And here you have a nice piece of optimal Q value function 👨‍🍳

Fig 2. shows the different optimal Q value functions obtained if we choose to sample 1000, 100 000 or 1 000 000 episodes. We logically see that the more samples, the smoother is the optimal Q value function.

Fig 2. The optimal Q value functions for different number of sampled episodes. As expected, the more episodes sampled the less variance in the optimal Q value function.

From this calculated optimal Q value function, the agent can now identify which action to pick in any given state to optimize his chances to end in a winning terminal state. That’s the Control my friends 👊

Remember in Fig 1, in the state :{player score = 13, dealer score = 9}, the player sticked and lost… Now, our trained agent 🤖 knows he would have a higher chance of winning it he hits instead, as we can see in Fig 3.

Fig 3. 🏆The optimal policy aka Holly Graal of RL 🏆

With this optimal policy above our agent knows exactly which action to pick in any state to optimize his chances to beat the dealer 💥🥊

Et voilà !

Ressources:
– David Silver’s RL course at UCL (youtube)
– This amazing Github repository gathering most known RL algorithms