RL Series-REINFORCE in PyTorch

Source: Deep Learning on Medium

RL Series-REINFORCE in PyTorch

This post is a part of my RL Series.

In this post, we want to review the REINFORCE algorithm. It is a Monte-Carlo Policy Gradient (PG) method. In PGs we try to find a policy to map state into action directly.


In value-based methods, we find a value function and use it to find the optimal policy. Policy gradient methods can be used for stochastic policies and continuous action spaces. If you want to use DQN for continuous action spaces, you have to discretize your action space. This will reduce the performance and if the number of actions is high, it will be difficult and impossible. But REINFORCE algorithms can be used for discrete or continuous action spaces. They are on-policy because they use the samples gathered from the current policy.

There are different versions of REINFORCE. The first one is without a baseline. It is as follows:

from Sutton Barto book: Introduction to Reinforcement Learning

In this version, we consider a policy (here a neural network) and initialize it with some random weights. Then we play for one episode and after that, we calculate discounted reward from each time step towards the end of the episode. This discounted reward (G in the above sudo code) will be multiplied by the gradient. This G is different based on the environment and the reward function we define. For example, consider that we have three actions. The first action is a bad action and the other two actions are some good actions that will cause more future discounted rewards. If we have three positive G values for three different actions, we are pushing the network towards all of them. Actually, we push the network towards action number one slightly and towards others more. Now consider we have one negative G value for the first action and two G values for the other two actions. Here we are pushing the network far from the first action and towards the other two actions. You see?! the value of G and its sign is important. It guides our gradient direction and its step size. To solve such problems, one way is to use baseline. This will reduce the variance and accelerate the learning procedure. For example, subtract the value of the state from it, or normalize it with the mean and variance of the discounted reward of the current episode. You can see the sudo code for REINFORCE with baseline in the following picture:

from Sutton Barto book: Introduction to Reinforcement Learning

In this version, first, we initialize the policy and value networks. It is possible to use two separate networks or a multi-head network with a shared part. Then we play an episode and calculate the discounted reward from every step until the end of the episode. Then subtract the value for that state from the discounted reward (this is the Advantage function) and use it to update the weights of value and policy networks. Then generate another episode and repeat the loop.

In the Sutton&Barto book, they do not consider the above algorithm as actor-critic (another RL algorithm that we will see in the next posts). It learns the value function but it is not used as a critic! I think because we update the network at the end of each episode and do not use critic (our value function) to tell us how good is our policy or action in every step, it’s not actor-critic, despite we are learning value function. We consider all actions in the episode as good or bad. In the next posts, we will add this (telling the agent how good is its policy in every time-step -> it is critic) and will have actor-critics. As long as we do the update episodic, it is Monte-Carlo PG or REINFORCE, regardless of subtracting a constant baseline or a learned value function from G (Advantage =G-V). I think Monte-Carlo policy gradient and Actor-Critic policy gradient are better names as I saw in the slides of David Silver course.


One easy way is to consider actor-critics as PG methods and do not consider a separate category for actor-critics. As I saw here. The goal is to find the policy directly, either value function helps us or not. If the update is episodic we call it REINFORCE and if it is in every time step we call it actor-critic.

Anyway, let’s continue.

This algorithm can be used for either discrete or continuous action spaces. In discrete action spaces, it will output a probability distribution over action, which means that the activation function of the output layer is a softmax. For exploration-exploitation, it samples from the actions based on their probabilities. Actions with higher probabilities have more chances to be selected.

In continuous action spaces, the output will not have any softmax. Because the output is a mean for a normal distribution. We consider one neuron for each action and it can have any value. In fact, the policy is a normal distribution and we calculate its mean by a neural network. The variance can be fixed or decrease over time or can be learned. You can consider it as a function of the input state, or define it as a parameter that can be learned by gradient descent. If you want to learn the sigma too, you have to consider the number of actions. For example, if we want to map the front view image of a self-driving car into steering and throttle-brake, we have two continuous actions. So we have to have two mean and two variance for these two actions. During training, we sample from this normal distribution for exploration of the environment, but in the test, we only use the mean as action.

Here is my implementation for REINFORCE in a discrete action space with an episodic environment. I normalized G with the mean and variance of the current episode in this implementation and multiplied it by the gradient. I tested this algorithm on CartPole-v0 environment.

And here is another version for REINFORCE with continuous action space with an episodic environment. Same as the above code, I normalized G with the mean and variance of the current episode in this implementation and multiplied it by the gradient. I tested the algorithm on “MountainCarContinuous-v0” environment. In this environment, because of the reward function, there is an exploration problem. The agent can barely find the goal and get a good reward. So after some tries, it prefers to do nothing. It couldn’t solve this task and I think we need more advanced approaches to solve this task. We will come back to this environment in later posts with more advanced equipment!! So I decided to test this algorithm on a continuous version cart pole environment that I found here and it worked and learned the task completely.

Here I tried to learn the value function and use it in REINFORCE with discrete action space as baseline. It can learn the CartPole-v0 task completely.

Here again, we have the learned value function and use it with REINFORCE in continuous action space as baseline. It can learn the continuous cart pole task perfectly.

There are some resources to learn more about PGs and REINFORCE: