Original article was published on Artificial Intelligence on Medium

## Quick Summary of Blackjack

Blackjack is a card game played against a dealer. At the start of a round, both player and dealer are dealt 2 cards. The player can only see one of the dealer’s cards. The goal of the game is to get the value of our cards as close to 21 as possible, without crossing 21. The value of each card is listed below.

If the player has less than 21, they can choose to “hit” and receive a random card from the deck. They can also choose to “stand” and keep the cards they have. If the player exceeds 21, they “bust” and automatically lose the round. If the player has exactly 21, they automatically win. Otherwise, the player wins if they are closer to 21 than the dealer.

*There are more granular rules to Blackjack, read them** **here**.*

## Why read this post?

The purpose of this post is to show the dazzling power of even simple simulations in modeling the world around us. For any Reinforcement Learning endeavor, it is best to learn through simple games like Blackjack because they have only a few uncertainties to keep track of, clear reward and punishment rules, and a straightforward goal to aim for.

Casino games like Blackjack are designed to statistically favor the “house,” and I am interested in seeing if I can “crack” that through Reinforcement Learning algorithms. For Blackjack, my goal is to derive a player **policy** that actually yields an advantage over the casino/dealer in the long run.

## Some Key Terms

A policy, in this context, is a playbook that tells the player (or our **AI agent**) exactly what decision to make in each possible situation. An AI agent is the abstract being that will be learning and updating its knowledge through our Reinforcement Learning algorithm.

In our simplified model of Blackjack, we will only give the player two options: hit or stand. We will start our player with $1000, and assume they bet $100 each round. Here are the rewards/punishments for win/lose/tie:

- Win: +$100
- Tie: $0
- Lose: -$100

## Building Blackjack

I suggest going through the code below at your own pace after reading this post. Skim through the code for now just to get some ideas about how to build an simulation environment that best suits your needs. Check out the full notebook here.

## A TL:DR of the code above:

- Defined a Card and a Deck using OOP principles (classes, functions, etc) and basic data structures (lists, dictionaries, etc).
- Defined 2 functions that implement the logic behind a dealer’s turn and evaluating a dealer’s hand. There is a fixed, predictable logic for this defined here.
- Defined a function to implement the logic behind evaluating the player’s hand. There are different ways to approach this.
- Defined a master
`play_game()`

function that simulates games of Blackjack given parameters such as player policy, number of rounds, etc.

The focus of the rest of this article will be playing with the **player policy** input to this simulation environment to see what insights we can gain.

## Simple Monte Carlo Simulations

The crux of any Monte Carlo approach:

“Using randomness to solve problems that might be deterministic in principle.” — Wikipedia

More simply, we will rely on the compute power of machines to approximate certain conclusions instead of precisely deriving those results through pencil and paper.

## Quick (non-Blackjack) Example of the Monte Carlo Method

For example, let’s use a Monte Carlo approach to answer this simple question:

*“If I roll two 6-sided dice, what sum of the two dice is most likely?”*

As taught in every statistics class, we can solve this using pencil and paper by creating a probability distribution of the possible sums, and seeing which has the largest probability of occurring.

However, if we are too lazy to write that all out, we can code up a Monte Carlo simulation instead!

In the code below, I roll two dice 100,000 times and plot the distribution of the sums I get from the 100,000 simulations.

Voila! We clearly see from the distribution chart that 7 is the most likely sum when rolling 2 dice. The magic of compute power!

## Back to Blackjack

The dice example above is a bit contrived — Monte Carlo methods are best used for problems that would require humans large amounts of time to try and solve by hand.

Here is a more appropriate question for Monte Carlo simulations:

*“In Blackjack, what is the expected return per round if we stand when our hand ≥ 18, and hit otherwise?”*

A query such as this would be too overwhelming for a human to solve.

To answer this question, we will define our **player policy** as follows:

**Discrete Policy:**

- If hand ≥ 18: stand.
- Else: hit.

## Run the Simulation

Next, we will leverage compute power and the randomized nature of our Blackjack environment to run 100,000 rounds and calculate expected return:

The results show that, on average, a player would **lose $10** every round if they decided to use the Discrete Policy. We need to find a policy that yields better returns!

## The Stochastic Policy

The Discrete Policy is considered “discrete” because as soon as a condition is met (hand ≥ 18, for example), there is only one possible action that our player will take.

When we design our Reinforcement Learning AI agent, however, we will want it to have the chance to explore other actions even if some criteria tells it to do something else. It is important for our AI agent to keep exploring in the early stages of the learning process because its policy for picking actions has been determined by too small of a sample size.

Therefore, we need to get used to **stochastic** variables and policies that allow room for randomness when picking an action. Here is one we can try:

**Stochastic Policy:**

- If hand ≥ 18: 80% stand / 20% hit.
- Else: 80% hit / 20% stand.

## Run the Simulation

The results show that, on average, a player would **lose $18** every round if they decided to use the Stochastic Policy, which is **$8 more** than the Discrete Policy! Despite all the emphasis I put on the importance of stochastic policies in Reinforcement Learning, the Discrete Policy still won!

Using our human intuition, we can understand why this happened. It is always better in the long run to keep hitting when hand < 18 and keep standing when hand ≥ 18. It is detrimental in the long run to add a 20% chance of doing a counter-intuitive action in either of these states.

However, our AI agent will not be born with this intuition. Stochastic policies are fluid, and can be tweaked on the fly. This approach allows our AI agent to suffer through countless rounds of hitting when hand ≥18, and incrementally tweak its policy until it stands 100% of the time when hand ≥ 18.