Tricks for Manipulating Probability

Source: Deep Learning on Medium

4. Log-Derivative Trick :

The first approach is to differentiate the density q(z), also called Score Function Estimator.

What is a ‘score’ in statistics?
The score (or informant) is the gradient of the log-likelihood function log(L(θ)) w.r.t. the parameter vector θ (see wiki).

So, in our case, the trick is given by:

(Here, ϕ is the parameter, q is the likelihood function, and the gradient is w.r.t. ϕ)

Properties of the score function:
1. The expectation of the score function is 0. (Easy to prove. Hint: Jensen’s Inequality)

2. The variance of score function is given by the Fisher Matrix. (Just a useful property)

Let us find out how the above trick is used to solve the problem of stochastic optimisation.

In the figure given below,
1. The gradient is w.r.t. ϕ and the order of integration and gradient can be interchanged.
2. The gradient of q(z) is taken and identity trick is applied by multiplying and dividing with q(z).
3. Log derivative trick comes into play.
4. The gradient of the shown expectation is transformed into an expectation of the shown function.

Note that we introduce a constant c which does not change the equation using the property “expectation of the score function is 0”. But subtracting a calculated constant c from f(z) reduces the variance of the gradients.

Now, we are ready to derive the Vanilla Policy Gradient in Reinforcement Learning.

Let us look at some definitions used in the derivation:

  • A trajectory is a sequence of states and actions in the world.

Here, s and a denote state and action at any time t, respectively.

  • The goal of the agent is to maximise some notion of cumulative reward over a trajectory, known as a return.
  • One kind of return is the infinite-horizon discounted return, which is the sum of all rewards ever obtained by the agent but discounted by how far off in the future they are obtained. This formulation of reward includes a discount factor γ(0,1):
  • Another kind is the finite-horizon undiscounted return, which is just the sum of rewards obtained in a fixed window of steps. We will use this in our derivation.
  • A policy π is a rule used by an agent to decide what actions to take. It may be stochastic. The π in the below equation signifies the distribution of probability over actions in a given state. We often denote the parameters of such a policy by θ,
  • Let’s suppose that both the environment transitions P and the policy π are stochastic. In this case, the probability of a T-step trajectory given the policy π is:
  • The Log-Derivative Trick :
  • The log-probability of a trajectory is just :
  • The gradient of the log-probability of a trajectory is thus :
  • Note that the environment has no dependence on θ, so gradients of initial state probability ρ and transition probability P are zero.
  • The expected return is denoted by:

The goal in RL is to select a policy which maximises the expected return J(π) when the agent acts according to it. The central optimisation problem in RL can then be expressed by :

where, π* is the optimal policy.

We would like to optimise the policy by gradient ascent, like

The gradient of the policy performance,∇J(π), is called the policy gradient. The algorithms that optimise the policy in this manner are called policy gradient algorithms. The key idea underlying any policy gradient is to push up the probabilities of actions that lead to a higher return, and push down the probabilities of actions that lead to a lower return until you arrive at the optimal policy.

The derivation is given below (image snippet taken from Spinning Up RL). I highly recommend going through this resource to all RL practitioners. Link here.

This is an expectation, which means that we can estimate it with a sample mean. If we collect a set of trajectories D = {τ}, the policy gradient can be estimated with

This brings us to our last trick,