Reinforcement learning can be divided into Model-free and Model-based learning. Model-free learning emphasizes heavily on sampling. It collects a large number of training samples to fit the policy or the value function without the understanding of how things work. Model-based learning, on the contrary, uses or develops the understanding of the model (system dynamics) to compute optimal controls. A model (the system dynamics) describes how things move. Mathematically, the model f predicts the next state when taken an action u from a state x.
In this article, we will look into two commonly used trajectory optimization methods using a known or a learned system dynamics. Linear Quadratic Regulator LQR and iLQR calculate an optimal trajectory from the initial to the target state by optimizing a cost function.
LQR assumes the model is locally linear. iLQR uses an iterative version of LQR to find the optimal trajectory for non-linear systems. These methods are frequently used in many model-based methods. So it does worth some time to understand it. Be warned that the equations are tedious. But once past the fear and with some patience, the principle is really not hard.
LQR assumes the model is locally linear and can be time-varied. This linear model can be expressed in a matrix form as:
We also define a time-varied cost function. We optimize it to find the optimal trajectory that obeys the model. In summary, our objective is mathematically defined as:
Therefore, if the system dynamics f (model) and the cost function c is known, we can plan our optimal controls (actions) using a trajectory optimization method.
Let’s example the states, actions, and models with some simple examples. Our state may compose of the location and velocity in the x and y-direction of an object.
Our actions are the forces applied.
The objective below minimizes the amount of force applied.
But this is not very interesting. We can add another term to penalize the system if it deviates from a planned trajectory.
The following is an example of the possible model.
The following are two different paths taken from different actions.
Our problem statement is:
Linear Quadratic Regulator LQR Overview
Let’s have an overview of the algorithm first. It helps us to get a bigger picture before digging deep into the equations.
LQR composes of
- A backward pass from the target state to the initial state to express the optimal action at each time step in terms of the state. This relation is computed by optimizing the cost function.
- Once the backward pass reaches the initial state, we know x1 and therefore, we can find the first action u1 to be taken.
- A forward pass which we use the model to calculate the next state and use the equation in the backward pass to find the next optimal action.
Backward and forward path
Let’s get a little bit more detail in both paths. In the backward path, we calculate the Q-value function using the
- cost function C,
- the current state & the actions, and
- the value function V for the next state.
We optimize Q, so we can express the optimal action in terms of the state and compute the values for K and k.
As we move backward, we will reach the initial state which x1 (the initial state) is known and therefore, we can finally find u1.
The forward path is very simple. We use the model f to compute the next state. And use the equation computed in the corresponding backward pass to compute the next action.
Here is the big picture. Once we get it, the detail will look much simpler.
The backward pass is the most complicated and can get lost easily. It involves equations not targeted for the human consumption. But we will group that properly to make the points clear and manageable.
Time step T
Let’s start with the last time step in the backward pass.
Our prime objective here is to express the optimal action in terms of the state and compute the value for K and k.
The cost function at time step t for the current actions is:
We are going to break C into four sub-matrices and c into 2 so the equations will look more manageable later.
Cx,x composes of the coefficient in C that will multiply x, k, and x together.
Cu,x composes of the coefficient in C that will multiply u, k, and x together etc…
We generalize the cost function a little bit more by adding a constant and name it Q (a.k.a. Q-value function). We take the derivative to optimize the actions. Now, we can express the optimal action in terms of the current state. And K and k can be computed using the sub-matrices of C.
Here, we reach the first milestone:
Now we can express u in terms of x.
We can plug u it into Q.
Q is now no longer depend on u: the L.H.S. above actually becomes the value function V. Now with some heavy regrouping below, we can express V in terms of x, C, K, and k. This will be needed in the next step.
Time step T-1
Now, we move one more step backward.
We will compute Q again. But there is a major difference here. T-1 is not a terminal state and therefore we need to add the value function for the next state back.
And we just compute V from our last section. Apply the previous result,
and the model
Grouping terms again, we can simplify the equation to:
Optimize Q again:
Now, this is the major milestone of expressing the action in terms of the current state by minimizing the cost.
As we continue moving backward, we reach the initial state x1 which is known. And therefore, we can compute the first action u1.
The forward pass is very simple and just repeat the step below until reaching the target state.
Time step 1
In the backward pass, we compute the K and k when optimizing Q.
Now, using the current state, we can compute the action from
With both current action and state, we find the next state with the model.
Congratulations and thanks for the patience! Here is the final algorithm.
Can we use LQR if the model is stochastic? Actually, we can use LQR as if the model is deterministic.
A deterministic dynamics is defined as:
We can model the stochastic dynamics with a Gaussian distribution:
The good news is that we can still apply LQR with action taken as:
We can ignore Σ as it will cancel others out.
Iterative Linear Quadratic Regulator iLQR for Non-linear models
In LQR, we assume the model is locally linear. iLQR expands LQR to non-linear models. iLQR actually reuses LQR iterative to refine the trajectory. In iLQR, we start with a guessing trajectory and run LQR to refine the guess. We refine the guess iteratively with LQR and eventually, it will converge to the optimal solution.
Newton’s method for optimization
This process is very similar to Newton’s method for optimization. Gradient Descent uses the first-order derivative f’ to approximate a function. Newton’s method for optimization uses the first-order and second-order term to create a quadratic approximation of the original function f.
Starting with an initial guess, we construct a quadratic approximation with the first and second order derivative to approximate f locally. We optimize the quadratic function and use its optimal point as the next guess.
Eventually, the guess will converge to the optimal point of f if it exists.
In LQR, the dynamic model is linear. iLQR expands the concept to non-linear models. We approximate the dynamic model to first order and the cost function to the second order:
Rearrange the terms a little bit,
We can reformulate the equations as:
These equations look almost the same as the equations used in LQR:
So instead of taking x and u as the input parameters, the new functions take δx and δu.
Therefore, we use LQR as an inner loop optimization to find a new trajectory guess that has the lowest cost with the approximated function. Then we use the optimal point of the approximated function as the next guess.
Here is the algorithm:
In short, iLQR is an approximation of Newton’s method for solving:
Differential dynamic programming DDP
The iLQR method does not expand the dynamics to the second order so it is not exactly like Newton’s method. If we actually include the second term, this becomes Newton’s optimization and it is called DDP.
DDP is for the completion of our discussion. For the rest of our model-based RL discussion, we will mainly focus on iLQR.
Iterative LQR with Line Search
However, iLQR may have an overshoot problem just like the Newton’s method. If the original function is highly curved, a second order approximation may not be accurate enough. The optimal point of the approximation may overshoot the real optimal point.
To correct that, we can apply a line search (for example, a binary search) between the original guess and the new guess. We can apply α below with values ranging from 0 to 1 to search for the lowest point to avoid overshoot.
iLQR and LQR form the fundamental tools for model-based reinforcement learning. Combining the knowledge of the model and the cost function, we can plan the optimal actions accordingly. The equations may be tedious but we hope the explanations here will be it easier.
Credits & references
The proof in this article is based on UC Berkely Reinforcement Learning course in the optimal control and planning.
Source: Deep Learning on Medium