Source: Deep Learning on Medium

# Advantage Actor Critic continuous case implementation

Woha! This one have been quite tough! Also having a beautiful one year old kid doesn’t make writing articles and having side projects easy. Anyway, let’s go to the point…

After playing a bit with a2c for cartpole environment I wanted to try with a continuous case. Continuous refers to the fact that the actions can take any value 0.1, 2.678, -1.123456789… usually with minimum and maximum boundaries.

For example if we want to teach a RL agent to handle a dron it must be able to return optimal values for the engines, it doesn’t select what engine to turn on/off, but the speed of all of them. If it have 4 engines, the agent produces 4 simultaneous actions/numbers, 1 number per engine.

Another example, how much should turn the steering wheel of a car in degrees.

Those are continuous problems (although we can convert it into categorical by creating buckets).

So at the end the agent obtains probability distributions and takes samples from them (it’s basically the same idea I explained here: How to add uncertainty to your models).

Inside OpenAI gym we have the cartpole environment to play with this kind of problems. And guess what? I took me a looooot of effort to make this work. I couldn’t actually make it work my self, so I looked for a project with the a2c implemented, found stable-balines repo and went throw the code. It took me so much because, at least for my implementation, I had to add something I didn’t know about and that is gradient clipping. Well I knew about gradient clipping, but not that it would be so important here. Forget about it by now to keep it simple, I’ll cover it later.

# Architecture

## Actor

Couple of things to mention. As we can see it returns a probability distribution, we will use it later to sample from it. The network calculates the mean (the location) of the distribution, but not the standard deviation (the scale). Why? Because I have seen that this way is more stable, although having the standard deviation from the network is a valid option. See this quote from OpenAI:

In VPG, TRPO, and PPO, we represent the log std devs with state-independent parameter vectors. In SAC, we represent the log std devs as outputs from the neural network, meaning that they depend on state in a complex way. SAC with state-independent log std devs, in our experience, did not work.

I don’t want to make any point about which option is better, just that both are valid, depending on the case one might be preferred over the other.

Also note the standard deviation parameter isn’t the std directly, but the logarithm of it. Eh? What? Where are you calculating the logarithm? Nowhere. The parameter is considered the log, and because we handle it as the log of the standard deviation, it becomes the log std. We just have to exponentiate it to have the standard deviation for the distribution. By doing so we ensure we don’t have negative values (and some might argue that helps with numerical stability too). The standard deviation can’t be negative (nor 0).