# Trust Region Policy Optimisation(TRPO) — a policy-based Reinforcement Learning

Original article was published by Dhanoop Karunakaran on Artificial Intelligence on Medium

# Trust Region Policy Optimisation(TRPO) — a policy-based Reinforcement Learning

Many policy gradient approaches cannot guarantee the monotonous improvement of its performance due to rapid changes in the policy updates and the advantage function are noisy estimates. On the other hand, TRPO is based on trust-region optimisation that guarantees monotonous improvement by adding trust-region constraints to satisfy how close the new policy and old policy are allowed to be [2].

In trust region, we first decide the step size, α. We can construct a region by considering the α as the radius of the circle. We can call this region a trust region. The search for the best point(local minimum or local maximum in that region) for the improvement is bounded in that region. Once we have the best point it determines the direction. This process repeats until the optimal point is reached. The constraint TRPO is the step size, α.

The constraints are based on KL divergence which measures a distance between old and new probability distribution[2].

Let’s dive into TRPO

The goal is to optimise the objective function η(π). The objective here is to maximise the expected cumulative discounted reward

In policy gradient methods, the policy is modified explicitly to reach the optimal policy. Based on [6], we can write the policy update rule bases advantage over the original policy.

In TRPO paper [7], the above equation is modified as below:

If the advantage term is greater than zero, then we can guarantee that the policy improvement from the current policy. As state visitation based on the new policy is difficult to estimate, it makes it difficult to use the above equation straight away[5]. TRPO paper[7] suggested that we can change η to a local approximation, L with replacing state visitation based on the new policy to old policy(we already know this). The equation as shown below.

If the approximation is accurate within the trust region that can guarantee monotonic improvement. We need to make sure that local approximation bounded within the trust region constraints.