Original article was published by Marcos Carlomagno on Deep Learning on Medium

# Introduction

This week I was researching a lot the implementation of a reinforcement learning algorithm to train the agent for the game. So after reading some articles, forums and documentation, I confirmed that what I need to implement is Deep Q learning algorithm.

## Deep Q learning pseudocode

Based on this article called *Playing Atari with Deep Reinforcement Learning, *the algorithm looks like the following:

## Python implementation

I thought about implementing it from scratch but then I remembered that Github exists… and, after searching for a while, I found this implementation in Python to train an agent in the OpenAI Gym CartPole environment.

# JavaScript implementation

After having found this code on github, I decided to try to migrate it to javascript and integrate it to my environment with the same hyperparameters.

## Hiperparameters

## Model Building

The neural network has an input of 6 units that represent the 6 captured parameters of each state.

`state.subzeroLife`

state.kanoLife

state.subzeroPosition.x

state.subzeroPosition.y

state.kanoPosition.x

state.kanoPosition.y

Then it has two hidden layers of 24 units and an output of 9 units where the output represents all the possible actions to be performed by the agent.

`0: RIGHT`

1: LEFT

2: UP

3: DOWN

4: BLOCK

5: HP

6: LP

7: LK

8: HK

The unit with the highest scalar value represents the action to be executed.

## Update Target model

This function simply adds the weights of the model used for the fight and stores it in an “accumulator” model of training weights that is updated after each fight.

## Predict action

This function receives a state and returns an action for an agent, in this case it’s for Subzero but it’s the same for Kano. Also there is an epsilon probability that the agent will take a random action to ensure exploration of the environment.

## Memorize

Memorize adds a sequence of state, action, reward and next state to a collection called memory that is used to adjust actions to associated rewards.

## Replay

This is the function that performs the training itself, takes a batch sample from memory with a size defined in the hyperparameters and trains the model with the rewards received for the sequence of actions performed, then reduces the probability of exploration to reduce actions randomness.

## Main Function

In the main function we initialize the hyperparameters, and then perform a series of episodes (where each episode represents a fight). For each episode we start a time step counter that indicates each new state that will be processed to evaluate the next action.

Then we take batches of 32 sequences to train the model, where a sequence is a data structure with the following form

`seq = {state, action, reward, nextState, done}`

In this algorithm there are two models, `model`

is the model that learns from the batches of a particular fight and the `target model`

is the model that is updated with the `model`

weights after each fight, in other words, the model that develops learning from the numerous fights that the agents will have with each other.

# First training tests

In this first iteration the reward is calculated only as the life lost by the enemy after a state `s`

, the problem with this approach is that it takes too long to converge, because the priogram performs too many random actions without receiving reward

So, as possible solution, the reward can be calculated by reducing the distance with the enemy, this makes both agents act more aggressively approaching quickly and consequently receiving the reward associated with the fight in fewer iterations.

# Conclusion

In this article we make a first approach to an implementation of Deep Q-learning in javascript using the Google Colab environment.

During this week I am going to carry out different training tests varying the parameters to see if it reaches an acceptable result 😀.

You can check all the colab source codehereand the game githubhere.