Dynamic Programming to Artificial Intelligence: Q-Learning

Original article was published by Mars Xiang on Artificial Intelligence on Medium

A failure is not always a mistake, it may simply be the best one can do under the circumstances. The real mistake is to stop trying. — B. F. Skinner

Reinforcement learning models are beating human players in games around the world. Huge international companies are investing millions into reinforcement learning. Reinforcement learning in today’s world is so powerful because it requires neither data nor labels. It could be the technique that leads to general artificial intelligence.

Supervised and Unsupervised Learning

As a summary, in supervised learning, a model learns to map input to outputs using predefined and labeled data. An unsupervised learning approach teaches a model to cluster and group data using predefined data.

Reinforcement Learning

However, in reinforcement learning, the model receives no data set and guidance, using a trial and error approach.

Reinforcement learning is an area of machine learning defined by how some model (called agent in reinforcement learning) behaves in an environment to maximize a given reward. The most similar real-world example is of a wild animal trying to find food in its ecosystem. In this example, the animal is the agent, the ecosystem is the environment, and the food is the reward.

Reinforcement learning is frequently used in the domain of game playing, where there is no immediate way to label how “good” an action was, since we would need to consider all future outcomes.

Markov Decision Processes

The Markov Decision Process is the most fundamental concept of reinforcement learning. There are a few components in an MDP that interact with each other:

  • Agent — the model
  • Environment — the overall situation
  • State — the situation at a specific time
  • Action — how the agent acts
  • Reward — feedback from the environment

MDP Notation

An agent receives information about its current state from the environment, makes an action, and receives a reward. The process repeats. Source: Sutton, R. S. and Barto, A. G. Introduction to Reinforcement Learning

To repeat what was previously discussed in more mathematically formal terms, some notation must be defined.

  • t represents the current time step
  • S is the set of all possible states, with S_t being the state at time t
  • A is the set of all possible actions, with A_t being the action performed at time t
  • R is the set of all possible rewards, with R_t being the reward received after performing A_(t-1)
  • T is the last time step (the last step happens when a certain condition is reached or t is higher than a value)

The process can be written as:

  1. The agent receives a state S_t
  2. The agent performs an action A_t based on S_t
  3. The agent receives a reward R_(t+1)
  4. The environments transitions into a new state S_(t+1)
  5. The cycle repeats for t+1

Expected Discounted Return (Making Long-Term Decisions)

We discussed that in order for an agent to play a game well, it would need to take future rewards into consideration. This can be described as:

G(t) = R_(t+1) + R_(t+2) +… + R_(T), where G(t) is the sum of the rewards the agent expects after time t.

However, if T is infinite, in order to to make G(t) converge to a single number, we define the discount rate γ to be a number smaller than 1, and define:

G(t) = R_(t+1) + γR_(t+2) +γ²R_(t+2)+…

This can also be written as:

G(t) = R_(t+1) + γG(t+1)

Value and Quality (Q-Learning is Quality-Learning)

A policy describes how an agent will act given any state it finds itself in. An agent is said to follow a policy. Value and Quality functions describe how “good” it is for an agent to be in a state, or a state and perform an action.

Specifically, the value function v_p(s) is equal to the expected discounted return while starting in state s and following a policy p. The quality function q_p(s, a) is equal to the best expected discounted return possible while starting in state s, performing action a, and then following policy p.

v_p(s) = (G(t) | S_t=s)

q_p(s, a) = (G(t) | S_t=s, A_t = a)

A policy is better or equal to another policy if it has a greater or equal discounted expected return for every state. The optimal value and quality functions v* and q* use the best possible policy.

Bellman Equation for Q*

The Bellman Equation another extremely important concept that turns q-learning into dynamic programming combined with a gradient descent-like idea.

It states that when following the best policy, the q value of a state and action (q_p(s, a)) is the same as the reward received for performing a during s plus the maximum expected discounted reward after performing a during s multiplied by the discount rate.

q*(s_t, a_t) = R_(t+1) + γq*(s_(t+1), a_(t+1))

The quality of the best action is equal to the reward plus the quality of best action on the next time step times the discount rate.

Once we find q*, we can find the best policy by using q-learning to find the best policy.


Q-learning is a technique which attempts to maximize the expected reward over all time steps by finding the best q function. In other words, the objective of q-learning is the same as the objective of dynamic programming, but with the discount rate.

In q-learning, a table with all possible state-action pairs is created, and the algorithm iteratively updates all the values of the table using the bellman equation until the optimal q-values are found.

We define a learning rate, a number between 0 and 1 describing how much of the old q-value we overwrite and the new one we keep each iteration.

The process can be described like with the pseudocode:

Q = np.zeros((state_size, action_size))
for i in range(max_t):
action = np.argmax(Q[current_state,:])
new_state, reward = step(action)
Q[state, action] = Q[state, action] * (1-learning_rate) + \ (reward + gamma * np.argmax(Q[new_state,:])) * learning_rate
state = new_state

Exploration and Exploitation

In the beginning, we do not know anything about our environment, so we want to prioritize exploring and gathering information, even it it means we do not get as much reward as possible.

Later, we want to increase our high score and prioritize finding ways to getting more rewards by exploiting the q-table.

To do this, we can create the variable epsilon, described by hyperparameters to describe when to explore, and when to exploit. Specifically, when a random number generated is higher than epsilon, we exploit, otherwise, we explore.

The new code is as follows:

Q = np.zeros((state_size, action_size))
epsilon = 1
for _ in range(batches):
for i in range(max_t):
if(epsilon > random.uniform(0, 1)):
action = np.argmax(Q[state,:])
action = np.random.rand(possible_actions(state))
new_state, reward = time_step(action)
Q[state, action] = Q[state, action] * (1-learning_rate) + \(reward + gamma * np.argmax(Q[new_state,:])) * learning_rate
state = new_state
epsilon *= epsilon_decay_rate



  • Reinforcement learning focuses on a situation where an agent receives no data set, and learns from the actions and rewards it receives from the environment.
  • The Markov Decision Process is a control process that models decision making of an agent placed in an environment.
  • The Bellman Equation describes a characteristic that the best policy has that turns the problem into modified dynamic programming.
  • The agent prioritizes exploring in the beginning, but eventually transitions to exploiting