# Uncertainty Aware Reinforcement Learning

Original article was published on Artificial Intelligence on Medium

# Uncertainty aware models — solutions that work

a) Learning a function that predicts bad behaviour

Consider drone learning to fly in a rain-forest. We wish it to learn to navigate that environment while avoiding collision with trees. In RL, an agent learns the consequence of an action by trying out that action. So to learn to dodge trees, it must experience a couple of hits. But high-speed blows would certainly cause destruction.

We can train the drone by letting it experience gentle, low-speed hits, so it learns the forest environment. When it encounters a section of the forest absent in the training distribution, it needs knowledge about the uncertainty of its policy to enable safe interaction with that section while collecting new training data. Once confident about that section, it can fly at high speeds in future. This is an example of safe exploration.

To achieve this, we integrate a cost for hitting trees in the RL cost function c(st, at) to have c(st, at) + C_bad. C_bad is the new cost assigned to behaviour that results in bad behaviour (collision). It influences when the drone can fly fast, and when it should tread with care.

To estimate C_bad, we use a bad-behaviour-prediction neural network P, with weights ϴ. It takes as input the current state st of the drone, it’s observation o plus a sequence of actions [a, aₜ ₊₁… aH] the drone plans to execute and estimates the probability of a collision occurring.

The action sequence is selected and optimized by Model Predictive Control (MPC) in a receding time horizon from the current time step t up to t + H. The bad-behaviour model Pϴ outputs a Bernoulli distribution (binary 0 or 1) indicating whether a collision occurred within this horizon.

The collision-labels are recorded for each horizon H. This means that for a label 1, bad behaviour occurred in the sub-sequence between time steps t and t + H. With this probability label conditioned on the above inputs, the bad-behaviour model can be simply expressed as:

Similarly, a naive implementation would look like this:

However, you might have noticed that Pϴ outputs the probability distribution over bad behaviour, and not actually the expense for that behaviour. So the actual bad-behaviour cost would be multiplied by this probability p to give pC_bad. Finally, we tune it with a scalar λ, that determines how important it is for the agent to avoid risky outcomes compared to achieving its goal.

It’s good to note, while we want a function that predicts unsafe actions, a discriminative model, which takes an input and gives a safety estimate might not always make us happy — its predictions might be quite meaningless in unfamiliar states. Preferably, it’s beneficial to incorporate model uncertainty in its predictions.

b) Bayesian Networks

A neural network can be termed as a conditional model P(y|x, w), which given an input x, allocates a probability to each possible output y using weights w. With Bayesian neural networks, instead of having a single output for each neuron, the weights are denoted as probability distributions over the possible values.

How does this work?

Using a set of training samples D, we find a posterior on the model-weights, conditioned on these samples P (w|D). To predict the distribution of a particular label ý, each viable combination of the weights, scaled by the posterior distribution, makes a prediction on the same input x.

If a unit is uncertain about the observation, this will be expressed in the output as weights with higher uncertainty introduce more variability in the prediction. This is common in regions the model has seen minimal or no data and will encourage exploration. As more observations are made, the model makes more deterministic decisions.

The posterior-distribution on the weights P (w|D) is approximated. This is done by trying to find a parameter ϴ of a different distribution on the weights q(w| ϴ) by making it as close as possible to the true posterior distribution P (w| D). This is variational inference; a little beyond our current scope :).

c) DropOut

Dropout in RL is a bad idea. This isn’t the risk we take here, though. Remember we said a discriminative model will not always make us happy unless it can incorporate uncertainty in bad-behaviour predictions? Dropout is a simple way to do that.

Dropout is a regularization technique that randomly drops a unit in a neural network with probability p, or retains it with probability 1 — p. It’s frequently used during training to prevent neurons from over-depending on each other. This creates a new but related neural network during each training iteration.

In practice, dropout is known to be applied only during training and removed at test time to achieve high test accuracy. However, by retaining dropout at test time, we can estimate uncertainty by finding the sample mean and variance of different forward-passes. It’s a simple approach to estimate uncertainty.

Its caveat is that dropout, as a variational inference method, underestimates uncertainty severely owing to the variational lower bound.

What does that mean?

To understand this, we need to introduce KL Divergence — a measure of the difference between two probability distributions over the same random variable.

At times, finding the true probability over large real-valued distributions is expensive. So an approximation to that distribution is used instead, and the KL divergence (difference) between the two minimized.

In the above illustration, q(x) is an approximation to the precise distribution p(x). This approximation aims to place a high chance of occurrence where p(x) has a high probability. On the illustration, notice q(x) is a single Gaussian, while p(x) is a mixture of two Gaussians? To place high probability where the probability of p(x) is high, q(x) evens the two Gaussians in p to place high probability mass on both, equally.

Similarly, dropout has a true posterior p(w| x, y) on the model’s weights w conditioned on the inputs x and the labels y. q(w) is used as an approximating distribution on this posterior. We then lower the KL divergence between q(w) and the actual posterior p(w| x, y) to make them as close as possible. However, doing so will penalise q(w) for placing probability mass where p(w) has no probability mass but just ignores q(w) for not placing high probability mass where p(w) actually has a high probability. This is what underestimates the model’s uncertainty.

c) Bootstrap Ensembles

Multiple independent models are trained, and their predictions averaged. Should these models approximate an almost similar output, it would show they agree, indicating certainty in their predictions.

To make the models independent of each other, each model’s weights ϴᵢ is trained with a subset of data sampled with replacement from the training set. However, random initialisation of the weights ϴᵢ and stochastic gradient descent during training is known to make them independent enough.

Dropout, as a measure of uncertainty, can be assumed a cheap approximation to an ensemble method, where each sampled dropout acts as a different model. The Bayesian Neural Network has the ensemble concept too — by taking an expectation under the posterior distribution on the weights P(w|D), the Bayesian network becomes equivalent to an infinite number of ensembles — many means better.

d) Curious iLQR

Think of curiosity as an inspiration to solve for uncertainties in the agent’s environment. Let’s see how we can add curious behaviour to an agent’s control loop.

Some LQR background

In RL, a Linear Quadratic Regulator (LQR) outputs a linear controller which is used to exploit the model. When working with non-linear dynamics, we fit the model p(sₜ ₊₁ | sₜ, aₜ) at each time step using linear regression. This iterative control process is called iterative LQR (iLQR), a form of differential dynamic programming (DDP).

The system dynamics is represented by the equation:

f represents the learned dynamics model while xₜ ₊₁ is the state at the next time step, expressed as the current state xₜ plus the model’s predicted change on the current state x when action u is taken. For instance, if the state is a robot’s velocity, x would be the current velocity, while f(x, u)Δt would be the predicted change when u is selected, resulting in a new velocity xₜ ₊₁.

Making it Curious

For the integration of uncertainty in the above system dynamics, it’s written as a Gaussian distribution, represented by a mean vector, μ, and a covariance matrix, Σ.

A Gaussian Policy has a neural network mapping a state and action pair to the mean change in state. This change in state is the mean vector expressed by the model μ(f).

We can implement the system dynamics as a Gaussian Process (GP) by drawing the model f from a normal distribution, where we attempt to learn the best mean vector μ that minimizes the cost function. The GP then delivers predictions using the equation:

where f(x, u) is the mean vector represented by the trainable dynamics function, and Σₜ ₊₁ the covariance matrix of the GP predictions at the current state and action.

This GP is identical to an ordinary non-curious LQR stochastic dynamics equations. What’s different? In non-curious iLQR, we would ignore the variance parameter Σₜ ₊₁ owing to symmetry of Gaussians. However, curious iLQR needs the covariance of the predictive distribution to ascertain the model uncertainty. High model uncertainty equals high variance. Σₜ ₊₁ represents the model’s uncertainty on the prediction xₜ ₊₁ at current state and action (x, uₜ).

This uncertainty from the GP model is then used to plead with the agent to take actions that resolve the model’s future-uncertainty on such states. In short, the agent is encouraged to select actions that reduce the model’s variance. This is done by rewarding the agent for acts that include some degree of uncertainty while still maximizing the goal-specific reward.

Understanding LQR optimization in Model-based RL can a bit juggling but is essential for grasping how the curiosity algorithm is derived. Let’s not scare ourselves with those equations now though.

Rewarding curious actions enables the agent to reach its goal faster than using standard iLQR. It prevents the model from getting stuck in local optima, which finds better solutions in a short time.