Advantage Actor Critic continuous case implementation

Source: Deep Learning on Medium

Advantage Actor Critic continuous case implementation

Woha! This one have been quite tough! Also having a beautiful one year old kid doesn’t make writing articles and having side projects easy. Anyway, let’s go to the point…

After playing a bit with a2c for cartpole environment I wanted to try with a continuous case. Continuous refers to the fact that the actions can take any value 0.1, 2.678, -1.123456789… usually with minimum and maximum boundaries.

For example if we want to teach a RL agent to handle a dron it must be able to return optimal values for the engines, it doesn’t select what engine to turn on/off, but the speed of all of them. If it have 4 engines, the agent produces 4 simultaneous actions/numbers, 1 number per engine.

Another example, how much should turn the steering wheel of a car in degrees.

Those are continuous problems (although we can convert it into categorical by creating buckets).

So at the end the agent obtains probability distributions and takes samples from them (it’s basically the same idea I explained here: How to add uncertainty to your models).

Inside OpenAI gym we have the cartpole environment to play with this kind of problems. And guess what? I took me a looooot of effort to make this work. I couldn’t actually make it work my self, so I looked for a project with the a2c implemented, found stable-balines repo and went throw the code. It took me so much because, at least for my implementation, I had to add something I didn’t know about and that is gradient clipping. Well I knew about gradient clipping, but not that it would be so important here. Forget about it by now to keep it simple, I’ll cover it later.

Architecture

Actor

Couple of things to mention. As we can see it returns a probability distribution, we will use it later to sample from it. The network calculates the mean (the location) of the distribution, but not the standard deviation (the scale). Why? Because I have seen that this way is more stable, although having the standard deviation from the network is a valid option. See this quote from OpenAI:

In VPG, TRPO, and PPO, we represent the log std devs with state-independent parameter vectors. In SAC, we represent the log std devs as outputs from the neural network, meaning that they depend on state in a complex way. SAC with state-independent log std devs, in our experience, did not work.

I don’t want to make any point about which option is better, just that both are valid, depending on the case one might be preferred over the other.

Also note the standard deviation parameter isn’t the std directly, but the logarithm of it. Eh? What? Where are you calculating the logarithm? Nowhere. The parameter is considered the log, and because we handle it as the log of the standard deviation, it becomes the log std. We just have to exponentiate it to have the standard deviation for the distribution. By doing so we ensure we don’t have negative values (and some might argue that helps with numerical stability too). The standard deviation can’t be negative (nor 0).

Critic

Nothing fancy here, we return 1 scalar value, the expected return for the given state.

Actor loss

A simplified version of the actor loss looks like this:

We obtain the log probability of the action and scale by the advantage. If the advantage is positive that’s good (the loss decreases), if the advantage is negative is bad (the loss increases).

Did you notice it? It’s almost the same as the negative log likelihood!

Critic loss

Again this one is simpler, just the mean squared error between td target and value from the critic (the value function).

The full code can be seen here, but I wanted to keep it simple here to make it easier to understand.

Gradient clipping

So as I said I found stable-baselines is using gradient clipping here https://github.com/hill-a/stable-baselines/blob/master/stable_baselines/a2c/a2c.py#L148, with a default value of 0.5. What this means is reflected on the following snippet:

In the example above clip_coef becomes the value we have to use to ensure the norm of the gradients is clipped at 0.5. You can take a look on pytorch’s implementation here, it’s easy to read: https://pytorch.org/docs/stable/_modules/torch/nn/utils/clip_grad.html

Some graphs to visualize the change:

We can see clearly how the magnitude of the gradients changes from one case to the other. With no clipping the gradients have a lot of variance, values going from 0 to 60, or even 120. While on the clipped version the gradient is lower than 0.5 always as we expect.

How does this affect the weights over time?

Looks like not much! I mean I would expect to see some big difference between the distribution of the weights in one case or another.

In the other hand clearly there is difference in the performance of both versions:

What improves it

I tried lot of things, just found one that really makes difference.

Mish activation

If you don’t know about this activation go and check it right now. It’s super simple to implement and results are striking, just replacing relus by this seems to improve almost any network.

Paper: Mish: A Self Regularized Non-Monotonic Neural Activation Function

And a very interesting discussion about Mish on fasti forum (this forum is gold).

How does this looks like on code?

And here you can see the difference on the rewards obtained by changing from tanh to mish:

What about the entropy?

You might have seen in some places that is recommended to add the entropy to the loss so the network doesn’t stop exploring too early. How does it works? Why the entropy can help? The formula of the entropy is as follows:

Only the variance (the square of the standard deviation) is part of it, only when the variance goes down this formula goes down, only when the variance goes up the formula goes up.

On stable-baselines repo https://github.com/hill-a/stable-baselines/blob/master/stable_baselines/a2c/a2c.py#L156 we can see the they use the entropy negated. The higher the variance the lower the term. So the network have to increase the variance to reduce the loss. Weird no? As mentioned, the reason for this is to ensure the network doesn’t stop exploring too early, forcing it to not reduce the variance making it part of the loss.

In practice I didn’t find any real difference with or without the entropy on the loss for this particular problem.

Code

github: https://github.com/hermesdt/reinforcement-learning/blob/master/a2c/pendulum_a2c_online.ipynb

google colab: https://colab.research.google.com/drive/11NsMuJ97xJ1s2IJ5yuBKvvePEQNzOBmZ