Trust Region Policy Optimisation(TRPO) — a policy-based Reinforcement Learning

Original article was published by Dhanoop Karunakaran on Artificial Intelligence on Medium


Trust Region Policy Optimisation(TRPO) — a policy-based Reinforcement Learning

Many policy gradient approaches cannot guarantee the monotonous improvement of its performance due to rapid changes in the policy updates and the advantage function are noisy estimates. On the other hand, TRPO is based on trust-region optimisation that guarantees monotonous improvement by adding trust-region constraints to satisfy how close the new policy and old policy are allowed to be [2].

Trust region optimisation strategy. Source: [4]

In trust region, we first decide the step size, α. We can construct a region by considering the α as the radius of the circle. We can call this region a trust region. The search for the best point(local minimum or local maximum in that region) for the improvement is bounded in that region. Once we have the best point it determines the direction. This process repeats until the optimal point is reached. The constraint TRPO is the step size, α.

KL divergence. Source: [3]

The constraints are based on KL divergence which measures a distance between old and new probability distribution[2].

Please refer this link to know more about trust-region optimisation strategy

Let’s dive into TRPO

The goal is to optimise the objective function η(π). The objective here is to maximise the expected cumulative discounted reward

In policy gradient methods, the policy is modified explicitly to reach the optimal policy. Based on [6], we can write the policy update rule bases advantage over the original policy.

In TRPO paper [7], the above equation is modified as below:

Source: [5][7][8]

If the advantage term is greater than zero, then we can guarantee that the policy improvement from the current policy. As state visitation based on the new policy is difficult to estimate, it makes it difficult to use the above equation straight away[5]. TRPO paper[7] suggested that we can change η to a local approximation, L with replacing state visitation based on the new policy to old policy(we already know this). The equation as shown below.

Source: [5][7][8]

If the approximation is accurate within the trust region that can guarantee monotonic improvement. We need to make sure that local approximation bounded within the trust region constraints.

Please refer this link to know more about trust-region optimisation strategy

TRPO paper[7] suggested that monotonically improving policy by maximising the below equation.

Source: [5][7][8]

We can change this equation in terms of the parameterised policy, θ

Source: [5][7][8]

Here C is the penalty coefficient that determines the step size[7]. In practice, if we use the penalty coefficient, the step size will be very small[7]. We can take large steps by using constraints on KL divergence between the new policy and old policy. This gives a trust-region where the update to the local approximation can guarantee the monotonic improvement of the policy. We can rewrite the above equation as below:

Source: [5][7][8]

As shown below, the max KL divergence is converted to mean KL divergence as mean optimisation is better than max optimisation.

Source: [5][7][8]

We can convert the above-constrained optimisation equation to sample-based estimation as shown below:

Source: [5][7][8]

If you like my write up, follow me on Github, Linkedin, and/or Medium profile.