Q-Learning: A Baby Step Towards Reinforcement Learning

Original article can be found here (source): Deep Learning on Medium

Q-Learning: A Baby Step Towards Reinforcement Learning

(Picture By Jessica Rocowitz On Unsplash)

Introduction To Reinforcement Learning:

Machine learning is primarily known for the supervised and unsupervised learning tasks. Here we would look into another interesting learning paradigm which has a special component called agent. An agent learns primarily by receiving signal which is called the reward signal. An example can be a robot which learns by receiving signals about where to move.

Terms Connected With Reinforcement Learning:

Here we shall see some of the important terms that are related to reinforcement learning.


It is a model which is supposed to fulfill a desired task. For an example self driving car can be an agent to fulfill the task of automated driving.


This is the world where the agent does all its action. It is also responsible to provide the agent a feedback about how it worked. Example the road for the self driving car.


This is basically the decision the agent takes in an environment. Example can be Steering a Car.

Reward Signal:

This is the response or feedback which the agent gets because of its action which is a scalar value.


A description of the environment as observed by the agent

Terminal State:

This is the state from where we can take no further action. Example can be reaching the destination or a trapped state by the self driving car.

Q- Learning:

After getting the basic idea about reinforcement learning let us now jump to the technique called q-learning. It is a basic algorithm which seeks to learn a policy that maximizes the reward.

Creating A Q-Table:

Creating a Q-table lies at the heart of this algorithm. We initialize the values of this matrix as zero. After every step taken we update and store the q-values. This table is considered as the reference table for our further actions.

q-table=np.zeros((state size,action size))

Updating The Table:

After initializing the table we are supposed to update the table. This happens when the agent interacts with the environment and updates the [action,state] pair accordingly.

The action can be of two types for example explore and exploit. Let us understand them.


Exploit based actions make the agent to take decisions on the maximum possible reward that can be availed in a given state.


In this method we allow the agent to select actions randomly instead of making a greedy choice. This way the agent is able to learn about new states unlike the previous method where agent’s action is largely dependent on the reward.

Both exploration and exploitation are required in reinforcement learning. Without exploration the agent won’t be able to know much about the environment whereas without exploitation agent won’t be able to take the correct decisions. Hence we would choose either of the action in a balanced way. We can choose a value epsilon to denote how much we would like to explore or exploit.

The Equation:

Now it is our turn to update the q-table. This can be achieved by the use of the following equation.

Q(state,action)=Q(state,action)+alpha[R(state,action)+gamma*max(Q(new state,action)- Q(state,action)]

Here state refers to the current state. The new state refers to the state reached by the agent after a certain action. The learning rate is alpha which signifies how much change we want in the old value based on the term that after that. Also one thing to notice is that gamma is the discount factor which is used to balance the future and immediate reward. It lies in [0,1]. A value of 1 signifies that the agent would give more importance to the current reward than the past rewards. Also the function R that we see is basically the reward which denotes how much the agent received after completing a certain action.

The above equation and all the factors discussed above would help us in filling the q-table.

In the next article we shall demonstrate an example based on q learning.