Source: Deep Learning on Medium
Recently, I’ve read Soft Actor-Critic paper that proposes an off-policy actor-critic deep RL algorithm using maximum entropy reinforcement learning framework. The authors did a solid job in explaining the nitty-gritty of the idea. While the paper is well-written and easy to follow, I found some of the equations difficult to comprehend and follow, especially for someone not familiar with reinforcement learning. One of those equations that needed more work to derive is the gradient estimator for the policy loss. The focus of this article is just to add more context into the computation process of an unbiased gradient estimator for the following policy loss. A basic familiarity with reinforcement learning helps to better understand this article.
Let’s find the gradient estimator for a more general case and then apply it to the SAC policy loss. The best reference that I’ve found to compute an unbiased gradient estimator with reparameterization trick is . In fact, many of the following equations are from . OK, get ready to dive into some integrals and derivations.
First, let’s define the general objective function that we want to compute an unbiased gradient estimator for:
We would like to compute the gradient of L w.r.t. θ, as the following equations:
Let’s first expand the expected value equation with an integral and apply the gradient:
The first integral can be easily converted back to an expectation. However, the the gradients of the distribution qθ(z) is intractable. There are two well-known approaches to rectify this, (1) score function method (aka log-derivative trick or REINFORCE) and (2) reparameterization. The Monte Carlo estimates of the latter technique typically yield lower variance than score function method . As such, SAC uses reparameterization to compute the gradients of the distribution qθ(z). The reparameterization trick replace the density function qθ(z) with a fixed distribution that does not depend on θ.
The reparameterization technique pushes all the functions depending on θ inside the expectation. We solved one issue, but another one is born. That is, computing inverse CDF of qθ(z), which for some distributions do not have a simple analytic expression (See  for why we do not use the inverse CDF).
We can fix this problem by using implicit differentiation. We first write the forward CDF formula on which we apply gradients.
Using the last three equations, we obtain the golden equation that helps us to compute the gradients of the objective.
Now we have all the ingredients to compute the gradients of the objective function. Let’s revisit the computation of the second integral and use some calculus techniques as follows:
Let’s review some of the calculus techniques we used (also you can refer to ). From Eq. (2) to Eq. (3), we just rewrite fθ(z) as the integral of derivative. “We also assume that fθ(z) is sufficiently regular that we can drop the boundary term at infinity”. From Eq. (3) to Eq. (4) we change the order of integration. To better understand how changing the order of integration works in this example, check the inequality I wrote in equation Eq. (3). In Eq. (5) we multiply the equation with 1 (qθ(z)/qθ(z)). I think you should realize by now what we are trying to achieve. We want to transform this integral to one that represents an expected value on qθ(z). Note that, in Eq. (5), we use the definition of gradients on F to obtain Eq. (6). Finally, we use the golden equation to obtain Eq. (8). Is this familiar to you? Yes. You are right. The integral in Eq. (8) represents an expected value on qθ(z).
To wrap up, we can write the gradients of our objective function as follows:
Note that, in Eq. 13 , the first and the second term corresponds to the first and the second term of our equation, correspondingly. I hope this article helps others to better understand the paper .
- OpenAI Spinning Up — Soft Actor-Critic
- Soft Actor-Critic Demystified
- The Generalized Reparameterization Gradient
- Implicit Differentiation
- Leibniz Integral Rule
- Implicit Reparameterization Gradient
- Pathwise Derivatives Beyond the Reparameterization Trick
- Soft Actor Critic — Deep Reinfrocement Learning with Real-World Robots