RL- Policy Gradient Method

Source: Deep Learning on Medium

Go to the profile of Unnat Singh

What are Policy Gradient Methods?

Policy gradient methods are a subclass of policy-based methods.

  • It estimates the weights of an optimal policy through gradient ascent by maximizing expected cumulative reward which an agent gets after taking optimal action in a given state.
  • Reinforcement learning is divided into two types of methods:
  1. Policy-based method (Policy gradient, PPO and etc)
  2. Value-based method (Q-learning, Sarsa and etc)

In the value-based method, we calculate Q value corresponding to every state and action pairs. And the action which is chosen in the corresponding state is the action which has maximum Q-value in the given state. So in the value-based method, we first calculate Q-value and then according to this value we choose the action which maximizes the Q-value and at the same time we choose the action and also update the Q-table.

And equation for finding Q-value changes on the approach which we take, basically there are two approaches:

  1. Monte Carlo (high variance, no bias)
  2. Temporal Difference (low variance, biased)

I will cover this in more detail in another article, in this article I will give you the equations for both Monte Carlo and Temporal Difference methods for calculating Q values.

Monte Carlo Method:

Monte-Carlo method

Temporal Difference Method:

Temporal Difference method

This is what a Q-table looks like:


In each state (represented in rows) the optimal action is the action(represented in columns) which has the highest value in that state. In state one, it is up arrow and state two its right arrow and in state three it’s again the up arrow.

Value-Based Method

  • Interaction → Optimal Value Function q* → optimal policy π*
  • Value Function is represented in the form of a table, where rows correspond to states and column corresponding to an action.
  • And then we use the above table to build an optimal policy.
  • But what about the environment with much larger state space?
  • So we investigated how to represent the optimal action-value function with a neural network which formed the basis of the Deep Q Learning algorithm.
  • Input dimension: state dimension, Output dimension: action dimension.
  • But the important message here is that both cases, whether we use the table for small state spaces or a neural network for much larger state spaces, we had to first estimate optimal action-value before we could tackle the optimal policy.

Can we directly find the optimal policy without worrying about a value function at all?

  • Yes! → Policy-based methods.
OpenAI Gym Documentation: https://gym.openai.com/docs

Policy Function Approximation

  • How might we approach this idea of estimating an optimal policy?
  • Let’s take cart pole example
  • In this case, the agent has two actions and dimension of state space is four.
  • It can push the cart either left or right.
  • At each time step agents picks one of two actions.
  • And we can construct a neural network that approximates the policy, that accepts the state as input and returns the probability of selecting each action in that state.
  • So if there are two possible actions, the output layer will have two nodes.
  • The agent uses this policy to interact with the environment by just passing the most recent state to the network.
  • It outputs the action probability and then the agent samples from those probabilities to select an action in response.
  • Our objective then is to determine appropriate values for the network weights so that each state that we pass into the network it returns the action probabilities where the optimal action is most likely to be selected.
  • This will help agent with its goal to maximize expected return.
  • This is an iterative process where weights are initially set to random values.
  • Then, as the agent interacts with the environment and learns more about the strategies which are best for maximizing the reward.

More on the policy

  • The agent can use a simple neural network architecture to approximate a stochastic policy. The agent passes the current environment state as input to the network, which returns action probabilities. Then, the agent samples from those probabilities to select an action.
  • The same neural network architecture can be used to approximate a deterministic policy. Instead of sampling from the action probabilities, the agent need only choose the greedy action.
  • Softmax as the output activation function.

Why Policy-Based Methods?

  • Simplicity
  • Stochastic Policies
  • Works in continuous Action Space
  • Deterministic: π: s → a
  • Stochastic: π(s,a) = P[a|s]
  • Formulating policy in this manner allows us to makes such generalization where possible and focus more on the complicated regions of state space.
  • One of the main advantages of policy-based methods over value-based methods is that they can learn true stochastic policies.

Alias State

  • Alias states are identical state, according to state features (which can be limited, and in reality are not identical) and have different optimal actions, but the agent cannot differentiate between the states.
  • The best action that can be taken in these states is equiprobable random action — which value-based methods are not able to do.
  • Because in the value-based method, the method or technique by which we take random action is epsilon-greedy method and after training a value based method, our epsilon is small, because of this we can’t take anymore random action with a smart agent!
  • So the only way to take equiprobable random actions is through policy-based methods.

Continuous action space

  • Policy-based methods are well suited for continuous action spaces.
  • When we use a value-based method even with the function approximator (neural network for continuous state and action), our output consists of a value for each action.
  • Now if acting space is continuous, then the max operation turns out to be an optimization problem itself.
q-learning update step (value based method) this is TD update
  • It’s like trying to find global maximum of a continuous function which is non-trivial.

The Big Picture

Before digging into the details of policy gradient methods, we’ll discuss how policy-gradient work at a high level.

Collect an episode.
Change the weights of the policy network:
if WON, increase the probability of each (state,action) combination.
if LOST, decrease the probability of each (state,action) combination.


  1. For each episode, if the agent won the game, we’ll amend the policy network weights to make each (state, action) pair that appeared in the episode to make them more likely to repeat in a future episode.
  2. For each episode, if the agent lost the game, we’ll change the policy network weights to make it less likely to repeat the corresponding (state, action) pairs in the future.

Connections to supervised learning

Policy gradient methods are very similar to supervised learning.

The important difference between Policy Gradient and Supervised Learning

  • One important difference is that when we do image classification (supervised learning). Typically we work with a dataset that doesn’t change over time(means for an image when we are training the label will be same every time whenever we are passing it through neural network unlike Policy Gradient in Reinforcement Learning).
  • However, in the reinforcement learning setting. The dataset varies by episode.
  • So we use policy to collect an episode, that gives us a dataset or a bunch of matched state action pairs and then we use that dataset once to do our batch of updates to neural network weights.
  • After these updates are done we discard the datasets and then collect a new set of data(another episode).
  • So dataset is changing continuously, it’s highly likely that we will experience a situation where the dataset has multiple conflicting opinions about what the best output should be for input, or in other words, what the best action is to take from a game state.
  • It like the same image appears in supervised learning with different labels.
  • So this does make our current situation more complex.

Problem Setup

We’re now ready to get started with rigorously defining how policy gradient methods work.


  • The first thing we need to define is the Trajectory.
  • Trajectory = state-action sequence
  • 𝑠0,𝑎0,𝑟1,𝑠1,𝑎1,𝑟2,𝑠2,𝑎2,𝑟3,𝑠3,…
  • But actually, a trajectory is a little bit more flexible because there is no restriction on its length.
  • So, it can correspond to a full episode or just a small part of an episode.
  • We denote trajectory length with H, where H stands for Horizon and denote trajectory with τ.
  • Then, the sum reward from the Trajectory is written as R(τ).
  • Our goal is to find weights θ of the neural network that maximize expected return.
  • One way of accomplishing this is by setting the weights of the neural network so that on average, the agent experience trajectories that yield a high return.
  • We denote the expected return by U and it’s a function of θ.
  • We want to find the value of θ that maximizes U.

Expected Return

expected return

Important note

U(θ) = Σ P(τ;θ)R(τ)

To see how it corresponds to the expected return, note that we’ve expressed the return R(τ) as a function of the trajectory τ. Then, we calculate the weighted average (where the weights are given by P(τ;θ)) of all possible values that the return R(τ) can take.

Why Trajectories?

We may be wondering: why are we using trajectories instead of episodes? The answer is that maximizing expected return over trajectories (instead of episodes) lets us search for optimal policies for both episodic and continuing task!

That said, for many episodic tasks, it often makes sense to just use the full episode.


We’ve learned that our goal is to find values of the weights θ in the neural network that maximizes the expected return U.

U(θ) = Σ P(τ;θ)R(τ)

where τ is an arbitrary trajectory. One way to determine the value of θ that maximizes this function is through gradient ascent. This algorithm is closely related to gradient descent, where the difference is that:

  • gradient descent is designed to find the minimum of a function, whereas gradient ascent will find the maximum, and
  • gradient descent steps in the direction of the negative gradient, whereas gradient ascent steps in the direction of the gradient.

Our update step for Gradient Ascent

gradient ascent
  • As we’ve learned, we can express the expected return as a probability-weighted sum, where we take into account the probability of each possible trajectory and, the return permits trajectory.
  • GOAL: find θ maximizes the expected return.
  • One way to do that is by Gradient Ascent, where we iteratively take small steps in the direction of the gradient.
  • Gradient Descent → minimizes the function:
  • θ ← θ – αδU(θ)
  • Gradient Ascent → maximizes the function:
  • θ ← θ +αδU(θ)
  • To apply above-mentioned methods we need to calculate the gradient.
  • Now, we won’t be able to calculate the exact value of the GRADIENT since that is computationally too expensive.
  • It’s computationally expensive because, in order to calculate the gradient exactly, we’ll have to consider every possible trajectory.
  • To calculate the gradient δU(θ), we have to consider every possible trajectory.
  • To estimate the gradient δU(θ), we have to consider a few trajectories.
  • Specifically, we’ll use the policy to collect trajectories.
  • Use the policy π(θ) to collect trajectories τ(1),τ(2), ….., τ(m).
  • Remember, any trajectory is just the sequence of states and action and we’ll use this notation refers to the ith trajectory.
  • Then we will use this trajectory to estimate the gradient δU(θ).


  • One important thing to note here is an estimate for the gradient which we refer to as g_hat is equal to the consolidation of information from the M trajectories.
  • And I’m not deriving this equation in this article.

Recap the big picture

Diving deep into Gradient Equation


Initial Idea

  • Change the Policy weight to accommodate the following condition:
  • If WON, increase the probability of each (state, action) combination (sequence) in the trajectory which ended in winning.
  • If LOST, decrease the probability of each (state, action) combination in the trajectory which combination (sequence) ended in loss.

Initial Assumption (simplifying the problem)

  1. Only one trajectory for calculating the gradient (for weights update)
  2. In terms of our gradient equation m=1.
  3. And trajectory τ corresponds to the full episode.

Now the equation for the simplified problem:

  • We currently assuming that we have calculated the full episode which we referred as to τ.
  • R(τ) is just a cumulative reward from that trajectory.
  • Now let’s take an example environment, in which the agent has to cross a road now the environment gives a reward of +1 if the agent crosses the road and -1 otherwise.
  • And τ is just a sequence of states and actions.
  • And πθ(at|st) term looks at the probability of that the agent selects action at in states st.
  • And πθ refers to a policy which agent follows and policy depends on the weight θ.
  • Then this full calculation here takes the gradient of the log of that probability.
  • This will tell us how we should change the weights of the policy theta if we want to increase the log probability of selecting action at and state st.
  • Specifically, if we nudge the weight in the direction of gradient then we increase the log probability of selecting that action in that state space and if we step in the opposite direction will decrease the log probability of selecting that action in that space.

Basic Working

  • The basic idea here is that if WON, then cumulative reward +1 so gradient is positive it will nudge weight in the direction of gradient and hence will increase the probability of taking that action in that state and if LOST, cumulative reward -1 so gradient is negative it will nudge the weight in direction opposite to the gradient and hence minimize it, will decrease the probability of taking that action in that state.

Implementation of Policy Gradient with PyTorch

What’s Next?

In this article, we’ve learned about the REINFORCE algorithm, which was illustrated with a toy environment with a discrete action space. But it’s also important to mention that REINFORCE can also be used to solve the environment with continuous action spaces.

For an environment with a continuous action space, the corresponding policy network could have an output layer that parameterized a continuous probability distribution.

For instance, assume the output layer returns the mean μ and variance σ2 of a normal distribution.

Then in order to select an action, the agent needs only to pass the most recent state st as input to the network, and then use the output mean μ and variance σ2 to sample from the distribution at — N(μ,σ2)

This should work in theory, but it’s unlikely to perform well in practice. To improve performance with continuous action space, we’ll have to make some small modification to the REINFORCE algorithm, and I’ll cover this in the upcoming article.


What are Policy Gradient Methods?

  • Policy-based methods are a class of algorithm that search directly for the optimal policy, without simultaneously maintaining value function estimation.
  • Policy gradient methods are a subclass of policy-based methods that estimate the weight of an optimal policy through gradient ascent.
  • In this article, we represent the policy with a neural network, where our goal is to find weights θ of the network that maximize expected return.

Τhe Big Picture

  • The policy gradient method will iteratively amend the policy network weights to:
  • make (state, action) pairs that resulted in positive return more likely, and
  • make (state, action) pairs that resulted in negative return less likely.