Build your First AI game bot using OpenAI Gym, Keras, TensorFlow in Python



This post will explain about OpenAI Gym and show you how to apply Deep Learning to play a CartPole game.

Whenever I hear stories about Google DeepMind’s AlphaGo, I used to think I wish I build something like that at least at a small scale. I think god listened to my wish, he showed me the way 😃.

Recently I got to know about OpenAI Gym and Reinforcement Learning. I started reading about these and loved it. I am not building game bot using Reinforcement learning for now. I will definitely plan it for the future.

We need to build a gaming bot means it needs to do what we do at the time of playing a game. If we don’t have any clue about a game in that case what we do usually?

we will ask our friends or read instructions about the game to get to know what keys we need to press at the time of playing a game right?

Then we start playing the game and observe the game and according to the current stage or state, we press the button like up or down or right or back or something else.

It means our gaming bot should know the current state/stage of the game and according to the current state bot need to press a button means it needs to perform a certain action. If that is a good move then we will get good results means positive rewards if not then we will get bad results means negative rewards. The bot should repeat these actions according to the current state every time until we win or lose the game means until we are done with the game.

You might be wondering how Bot gets to know the current state like us of the game environment. We observe the game through our eyes means Should we implement computer vision also to train our bot or what?

Here OpenAI gym is going to help us. OpenAI gym will give us the current state details of the game means environment. It will give us handle to do an action which we want to perform based on the current state/situation.

I am assuming you have Keras, TensorFlow & Python in your system if not please read this article first. We need to install OpenAI Gym. To install run below command

# If you are using python2 then use this command 'pip3 install gym'
pip3 install gym

Whenever we learn any new language we start with Hello World program usually right like that whenever someone starts to learn OpenAI Gym they start with CartPole game. In this Article, we will concentrate on this game.

OpenAI Gym gives us all details or information of a game and its current state. It also gives us handle to do the actions which we want to perform to continue playing the game until it’s done/completed.

Before writing the code let’s understand some vocabulary which we are going to use with respect to OpenAI Gym.

environment — It is like an object or interface through which we or our game bot(agent) can interact with the game and get details of current state and etc. There are several different games or environments available. You can find them here
step:- It’s a function through which we can do an action like what actually we want to do to at the current state/stage of the game.
action:- It’s a value or object which we basically want to do at the current state/stage of the game. Like moving right or left or jump or etc.
observation (object):- An environment-specific object representing your observation of the environment. For example, pixel data from a camera, joint angles and joint velocities of a robot, or the board state in a board game.
reward (float):- Amount of reward achieved by the previous action. The scale varies between environments, but the goal is always to increase your total reward.
done (boolean):- whether it’s time to reset the environment again. Most (but not all) tasks are divided up into well-defined episodes, and done being True indicates the episode has terminated. (For example, perhaps the pole tipped too far, or you lost your last life.)
info (dict):- diagnostic information useful for debugging. It can sometimes be useful for learning (for example, it might contain the raw probabilities behind the environment’s last state change). However, official evaluations of your agent are not allowed to use this for learning.


Let’s understand about OpenAI Gym by writing some code for CartPole.

Let’s understand above code line by line.
1. Imported gym package.
2. Created ‘CartPole’ environment.
3. Reset the environment.
4. Running a loop to do several actions to play the game. For now, let’s play as much as we can. That’s why trying here to play up to 1000 steps max.
5. env.render() — This is for rendering the game. So, that we can see what actually happens when we are taking any steps/actions.
6. Getting a random action which we are going to take. Here “env.action_space.sample()” code will give us a random action which is allowed to play this game.
7. Doing that random action through the step function. This will return us observation, reward, done, info.
8–13. Printing them to know what we did, what exactly happened, what reward we got and whether game completed or not.
14–15. If the game is completed/done stop taking next step/action.

Finally, this code will give us output like this.

Step 0:
action: 0
observation: [-0.0120364 -0.21794704 -0.00759187 0.26062941]
reward: 1.0
done: False
info: {}
Step 1:
action: 1
observation: [-0.01639534 -0.02271754 -0.00237928 -0.03443839]
reward: 1.0
done: False
info: {}
Step 2:
action: 1
observation: [-0.0168497 0.17243845 -0.00306805 -0.32787105]
reward: 1.0
done: False
info: {}
Step 3:
action: 0
observation: [-0.01340093 -0.02263969 -0.00962547 -0.03615722]
reward: 1.0
done: False
info: {}
Step 4:
action: 1
observation: [-0.01385372 0.17261896 -0.01034861 -0.33186148]
reward: 1.0
done: False
info: {}
Step 5:
action: 1
observation: [-0.01040134 0.36788668 -0.01698584 -0.6277898 ]
reward: 1.0
done: False
info: {}
Step 6:
action: 1
observation: [-0.00304361 0.56324153 -0.02954164 -0.9257734 ]
reward: 1.0
done: False
info: {}
Step 7:
action: 1
observation: [ 0.00822122 0.75874972 -0.04805711 -1.22759172]
reward: 1.0
done: False
info: {}
Step 8:
action: 1
observation: [ 0.02339622 0.95445615 -0.07260894 -1.53493579]
reward: 1.0
done: False
info: {}
Step 9:
action: 1
observation: [ 0.04248534 1.15037357 -0.10330766 -1.84936587]
reward: 1.0
done: False
info: {}
Step 10:
action: 1
observation: [ 0.06549281 1.3464699 -0.14029497 -2.17226059]
reward: 1.0
done: False
info: {}
Step 11:
action: 0
observation: [ 0.09242221 1.15296679 -0.18374019 -1.92596929]
reward: 1.0
done: False
info: {}
Step 12:
action: 0
observation: [ 0.11548155 0.96023062 -0.22225957 -1.69544763]
reward: 1.0
done: True
info: {}

This will give us some fair idea about what’s happening. After every step, we get reward 1.0 until the game ends. It also gives us the current position of the pole & cart also. Here Action 0 means move the cart to left and 1 means move the cart to right. For more details visit its wiki page.

So we got an understanding about OpenAI Gym. So it’s time to build our model means we need some data to train. So let’s play some games with some random actions and collect the data of the games which we played better.

First, let’s import the packages required to do our job.

import gym
import random
import numpy as np
from keras.models import Sequential
from keras.layers import Dense
from keras.optimizers import Adam

Then let’s initialise some variables which required further.

env = gym.make('CartPole-v1')
env.reset()
goal_steps = 500
score_requirement = 60
intial_games = 10000

Below code, we will use to populate the data we need for our deep learning model training.

Let’s understand what we are doing in above model_data_preparation function.
1. We initialized training_data and accepted_scores arrays.
2. We need to play multiple times so that we can collect the data which we can use further. So we will play 10000 times so that we get a decent amount of data. This line for that “for game_index in range(intial_games):”
3. We initialized score, game_memory, previous_observation variables where will store the current game’s total score and previous step observation(means the position of Cart & Pole) and the action we took for that.
4. for step_index in range(goal_steps): — This code is to play the game for 500 steps because playing this game till 500 steps mean successfully completing the game.
5. We need to take random actions so that we can play the game which may lead to successfully completing the step or losing the game. Here only 2 actions allowed moving to left(0) or right(1) so this code(random.randrange(0, 2)) is for taking one of the random action.
6. We will take that action/step. Then we will check if it’s not a first action/step then we will store the previous observation and action we took for that. Then we will add the score and check whether the game is completed or not if yes then stop playing it.
7. We will check whether this game fulfilling our minimum requirement or not means are we able to play up to 60 steps or not.
8. If we are able to play 60 steps then we will add this score to accept_scores which we further print to know how many games data and their score which we are feeding to our model.
9. Then we will do hot encoding of action because its values 0(moving left),1(moving right) represent categorical data.
10. Then we will add that to our training_data.
11. We will reset the environment to make sure everything clear to start playing next game.
12. print(accepted_scores) — This code is to know how many games data and their score which we are feeding to our model. Then we will return the training data.

This code finally prints accepted_scores after execution.

[63.0, 62.0, 74.0, 66.0, 84.0, 69.0, 65.0, 64.0, 66.0, 63.0, 62.0, 67.0, 62.0, 60.0, 76.0, 65.0, 87.0, 85.0, 76.0, 81.0, 68.0, 63.0, 80.0, 65.0, 63.0, 60.0, 60.0, 61.0, 86.0, 71.0, 72.0, 60.0, 95.0, 65.0, 68.0, 68.0, 63.0, 95.0, 91.0, 99.0, 86.0, 68.0, 72.0, 69.0, 62.0, 74.0, 76.0, 74.0, 64.0, 77.0, 92.0, 67.0, 67.0, 99.0, 81.0, 81.0, 63.0, 73.0, 70.0, 68.0, 63.0, 77.0, 61.0, 62.0, 78.0, 61.0, 71.0, 77.0, 70.0, 72.0, 80.0, 61.0, 68.0, 61.0, 86.0, 145.0, 74.0, 68.0, 79.0, 61.0, 63.0, 65.0, 62.0, 64.0, 65.0, 80.0, 67.0, 78.0, 76.0, 66.0, 63.0, 110.0, 62.0, 70.0, 72.0, 109.0, 76.0, 75.0, 75.0, 73.0, 75.0, 65.0, 77.0, 64.0, 61.0, 60.0, 66.0, 61.0, 62.0, 71.0, 75.0, 82.0, 95.0, 67.0, 61.0, 66.0, 67.0, 65.0, 61.0, 65.0, 66.0, 62.0, 70.0, 89.0, 96.0, 86.0, 62.0, 61.0, 75.0, 84.0, 63.0, 66.0, 73.0, 68.0, 61.0, 66.0, 144.0, 64.0, 61.0, 62.0, 62.0, 67.0, 66.0, 65.0, 66.0, 71.0, 68.0, 81.0, 73.0, 75.0, 75.0, 79.0, 75.0, 104.0, 69.0, 66.0, 81.0, 73.0, 60.0, 64.0, 78.0, 115.0, 62.0, 91.0, 70.0, 69.0, 64.0, 86.0, 70.0, 70.0, 68.0]

So our data is ready. Its time to build our neural network.

def build_model(input_size, output_size):
model = Sequential()
model.add(Dense(128, input_dim=input_size, activation='relu'))
model.add(Dense(52, activation='relu'))
model.add(Dense(output_size, activation='linear'))
model.compile(loss='mse', optimizer=Adam())
return model

Here we are going to use the sequential model.

def train_model(training_data):
X = np.array([i[0] for i in training_data]).reshape(-1, len(training_data[0][0]))
y = np.array([i[1] for i in training_data]).reshape(-1, len(training_data[0][1]))
model = build_model(input_size=len(X[0]), output_size=len(y[0]))

model.fit(X, y, epochs=10)
return model

We have the training data so from that we will create features and labels.

Then we will start the training

trained_model = train_model(training_data)

The output we will get like this

Epoch 1/10
12236/12236 [==============================] - 1s 94us/step - loss: 0.2483
Epoch 2/10
12236/12236 [==============================] - 1s 71us/step - loss: 0.2348
Epoch 3/10
12236/12236 [==============================] - 1s 67us/step - loss: 0.2333
Epoch 4/10
12236/12236 [==============================] - 1s 68us/step - loss: 0.2334
Epoch 5/10
12236/12236 [==============================] - 1s 64us/step - loss: 0.2325
Epoch 6/10
12236/12236 [==============================] - 1s 63us/step - loss: 0.2324
Epoch 7/10
12236/12236 [==============================] - 1s 66us/step - loss: 0.2315
Epoch 8/10
12236/12236 [==============================] - 1s 65us/step - loss: 0.2318
Epoch 9/10
12236/12236 [==============================] - 1s 65us/step - loss: 0.2317
Epoch 10/10
12236/12236 [==============================] - 1s 65us/step - loss: 0.2318

It’s time for our gaming bot to play the game for us.

Let’s understand this code
1. We are initializing the scores and choices arrays which will store what scores we got and what choices we made.
2. To play the game 100 times.
3. Initializing score and prev_obs variables to store current score and previous observation.
4. We will play the game for 500 steps that’s why this loop(for step_index in range(goal_steps):)
5. For the first step, we don’t know the state and other things so we will take a random step.
6. From second step onwards we know the current state of the game. So we will take that observation and give it to our model to predict which action we need to take. This part(trained_model.predict(prev_obs.reshape(-1, len(prev_obs)))) of the code will give us the probability of each category. We will take the maximum probability category and take that action.
7. We will store the choice we made, and store the current state and will add the reward to our score.
8. We will reset the environment after finishing the game to play next game and store the game’s completion score to print.

We will get output like this

[500.0, 500.0, 500.0, 500.0, 500.0, 500.0, 500.0, 500.0, 500.0, 500.0, 500.0, 500.0, 500.0, 500.0, 500.0, 500.0, 500.0, 500.0, 247.0, 500.0, 500.0, 500.0, 500.0, 500.0, 500.0, 500.0, 500.0, 500.0, 259.0, 500.0, 500.0, 500.0, 500.0, 500.0, 500.0, 500.0, 500.0, 500.0, 500.0, 500.0, 500.0, 500.0, 500.0, 264.0, 500.0, 500.0, 500.0, 500.0, 500.0, 500.0, 500.0, 500.0, 500.0, 500.0, 500.0, 500.0, 500.0, 500.0, 500.0, 241.0, 500.0, 500.0, 500.0, 500.0, 500.0, 500.0, 500.0, 500.0, 500.0, 500.0, 500.0, 500.0, 500.0, 500.0, 500.0, 500.0, 500.0, 500.0, 500.0, 500.0, 500.0, 500.0, 500.0, 500.0, 500.0, 500.0, 500.0, 255.0, 500.0, 500.0, 500.0, 245.0, 500.0, 500.0, 500.0, 500.0, 500.0, 500.0, 500.0, 500.0]
Average Score: 485.11
choice 1:0.5007936344334275 choice 0:0.49920636556657255

Finally, our bot is trained. It will finally play like a pro.

Image result for cartpole

You will find Jupyter notebook for this implementation here.


If you enjoyed this article, show me your love by giving it some claps 👏.
Peace. Happy Coding.

Source: Deep Learning on Medium