Reinforcement Learning : Q-Learning

Original article was published on Artificial Intelligence on Medium

Reinforcement Learning : Q-Learning

Introduction to Q-Learning from scratch, we’ll illustrate how this technique works by introducing a robot example where a reinforcement learning agent tries to maximize points. So, let’s get to it!

Q-learning, as the name suggest, it’s a learning based algorithm in reinforcement learning. Q-Learning is a basic form of Reinforcement Learning algorithm that uses Q-values (also called action values) to iteratively improve the behavior of the learning agent.

Main objective of Q-learning is to find a best optimal policy, discussed previously that maximizes the cumulative rewards (sum of all rewards). So, in other words, the goal of Q-learning is to find the optimal policy by learning the optimal Q-values for each state-action pair.

Q-Learning simplified with an example

Let’s consider a ROBOT who starts form the starting position (S) and it’s goal is to move the end point (G). This is basically a game: Frozen Lake where; S=starting point, F=frozen surface, H=hole, and G=goal. Robot can either move forward, downward, left, right.

Robots wins if reaches Goal and looses if falls in a Hole.

Thanks for gif: Link

Now, the obvious question is: How do we train a robot to reach the end goal with the shortest path without stepping on a hole?

Introduction to Q-TABLE

Q-Table is a simple lookup table where we calculate the maximum expected future rewards for future action at each state. Q-table is created as per the possible action the robot can perform. Q-table is initialized with null values.

Initial Q-table

Each Q-table score will be the maximum expected future reward that the robot will get if it takes that action at that state. This is an iterative process, as we need to improve the Q-Table at each iteration.

But the questions are:

  • How do we calculate the values of the Q-table?
  • Are the values available or predefined?

To learn each value of the Q-table, we use the Q-Learning algorithm. As we discussed in the earlier part, that we use Bellman Equation to find optimal values for the action-value pair.

As we start to explore the environment, the Q-function gives us better and better approximations by continuously updating the Q-values in the table.


Now, we’ll set some standard number of episodes that we want the robot to play. Let’s say we want the robot to play five episodes. It is during these episodes that the learning process will take place.

In each episode, the robot starts out by choosing an action from the starting state based on the current Q-values in the table. The robot chooses the action based on which action has the highest Q-value in the Q-table for the current state.

But, wait… That’s kind of weird for the first actions in the first episode, right? Because all the Q-values are set zero at the start, so there’s no way for the robot to differentiate between them to discover which one is considered better. So, what action does it start with?

To answer this question, we’ll introduce the trade-off between exploration and exploitation. This will help us understand not just how an agent takes its first actions, but how exactly it chooses actions in general.

Exploration vs Exploitation

These are the main characteristics of an RL algorithm as discussed in the first blog of this series. Exploration is the act of exploring the environment to find out information about it. Exploitation is the act of exploiting the information that is already known about the environment in order to maximize the return.

Exploring Frozen Lake

Agent (aka Robot), don’t just need to exploit the environment but it even needs to explore it. In order to maximize cumulative rewards or the maximize the future rewards, Out Agent need to know the environment well to perform better. SO, Finally the agent need a balance between exploration and exploitation. To get this balance between exploitation and exploration, we use what is called an epsilon greedy strategy.

Epsilon Greedy Strategy

To get this balance between exploitation and exploration, we use what is called an epsilon greedy strategy. With this strategy, we define an exploration rate ε that we initially set to 1. This exploration rate is the probability that our agent will explore the environment rather than exploit it. With ε=1, it is 100% certain that the agent will start out by exploring the environment.

In the beginning, the epsilon rates will be higher. The robot will explore the environment and randomly choose actions. The logic behind this is that the robot does not know anything about the environment.

To determine whether the agent will choose exploration or exploitation at each time step, we generate a random number between 0 and 1. If this number is greater than epsilon, then the agent will choose its next action via exploitation, i.e. it will choose the action with the highest Q-value for its current state from the Q-table. Otherwise, its next action will be chosen via exploration, i.e. randomly choosing its action and exploring what happens in the environment.

Updating Q-Value

To update Q-value for the action of moving right taken from the previous state, we use the Bellman equation that are highlighted previously. We want to make the Q-value for the given state-action pair as close as we can to the right hand side of the Bellman equation so that the Q-value will eventually converge to the optimal Q-value q*.

This will happen over time by iteratively comparing the loss between the Q-value and the optimal Q-value for the given state-action pair and then updating the Q-value over and over again each time we encounter this same state-action pair to reduce the loss.

To actually see how we update the Q-value, we first need to introduce the idea of a learning rate.

Learning Rate

The learning rate is a number between 0 and 1, which tells the agent to works on previous Q-values or newly learned Q-values. A factor of 0 makes the agent learn nothing (exclusively exploiting prior knowledge), while a factor of 1 makes the agent consider only the most recent information (ignoring prior knowledge to explore possibilities). We denote the learning rate with α symbol. Setting a high value such as 0.9 means that learning can occur quickly.

Other parameters present in previous equation, are already discussed, but let’s quickly revise again:

γ:discount factor, also set between 0 and 1. This models the fact that future rewards are worth less than immediate rewards. Mathematically, the discount factor needs to be set less than 0 for the algorithm to converge.

max(α)the maximum reward that is attainable in the state following the current one, i.e. the reward for taking the optimal action thereafter.

Calculating the new Q-Value

Source: here

By using our learning rate α, we calculate the new q-values.

New Q-Value Equation

For the first action old value will be 0, further on we will consider it. Alright, so now we’ll take this new Q-value we just calculated and store it in our Q-table for this particular state-action pair.

We’ve now done everything needed for a single time step. This same process will happen for each time step until termination in each episode.

Once the Q-function converges to the optimal Q-function, we will have our optimal policy.


  1. Q-Learning is a value-based reinforcement learning algorithm which is used to find the optimal action-selection policy using a Q function.
  2. Our goal is to maximize the value function Q, in order to maximize the cumulative rewards.
  3. The Q-table helps us to find the best action for each state.
  4. Initially all the values in Q-table are zeroes.
  5. Q(state, action) returns the expected future reward of that action at that state. This function can be estimated using Q-Learning, which iteratively updates Q(s, a) using the Bellman equation.
  6. Finally, we update the Q-value as per the above formula.

Thus, in this blog, we almost complete all basics of algorithms of Reinforcement Learning. In the next one, we will implement Reinforcement Learning in Python, by building a game with OpenAI Gym.