Bayesian Neural Networks: 3 Bayesian CNN

Original article was published by Adam Woolf on Artificial Intelligence on Medium


In the deviant manor that comes naturally to the author, a strange new dataset has been created to highlight the problem. We find ourselves helping parents solve an important problem. Parents are always interested in the measuring the height of their children. Well, no longer do they need fret about carrying a ruler around with them all day. We’ll create a model that estimates height from a picture. A dataset has been created consisting of 836 silhouettes of babies and toddlers in addition to their heights. Rather than classification problem we’re therefore solving a regression problem. We aim to return a single floating-point value corresponding to the height of the child silhouetted in the photograph. It’s a slightly harder problem than the classification exercise from the last chapter and made even more so by the occasional presence of a spider. While the training set consists only of valid human silhouettes, after training we’ll throw in spiders just for fun. And this is where it gets fun, because we want to avoid returning height measurements for insects but we aren’t allowed any pictures of insects in training.

Example original silhouettes of children (left and right two images) for training that are on either side of a silhouette of a hairy spider that isn’t available for training. The silhouettes are all randomly rescaled to 1/2–2x the original sizes shown.

Of course spiders are usually smaller things than children. So to prevent simple discrimination based on a ridiculousl difference in size, spiders were upscaled to occupy a space comparable to the children. Furthermore, in the spirit of making the task quite arbitrarily difficult, the children have been randomly rescaled so silhouettes are anywhere between 1/2 and 2x their original. Of course, rescaling doesn’t make much sense when we’re interested in the height of the children! But the rather elaborately contrived senseless problem perfectly demonstrates the power of Bayesian Deep Learning. You’ll see how well the models generalise to real world situations with few training examples and without any examples of the corrupted data they likely to receive!

Lets get stuck into the problem with TensorFlow probability. The full code as well as the data is available in a Jupyter Notebook online at: https://github.com/DoctorLoop/BayesianDeepLearning. First we’ll define the architecture.

Bayesian Convolutional Architecture (https://gist.github.com/DoctorLoop/293ae5cc3bda2ccc333d9b216eacc301)

In the first line we clear any session that might already be in memory, emptying any parameter stores and variables so there’s no interference with a fresh run. Next we define a lambda function that helps us update the loss via the Kullback-Leibler (KL) divergence that we discussed in the previous chapter. We then pass this lambda to each convolutional layer so the loss can be updated with reference to the divergence between an approximate distribution and our prior. Strictly speaking this isn’t absolutely necessary to specify as the default parameter for the layer is almost same. The difference however is that while the default parameter just gets the KL divergence, we go one step further and divide it by the total number of examples (836). The default implementation applies the epoch’s total KL to every example. But what we’d prefer is to apply only a proportion of the total epoch’s KL to each example rather than the total each time. While both will train we see better results through scaling the loss. Experiment and see for yourself.

The actual model is defined just as it is for any other keras sequential. Of course we’re using a Convolutional2DFlipout layer (we’ll discuss that later) rather than the usual Conv2D. You might be surprised we’re only using two convolutional layers in a time when its near enough a fashion to use hundreds. We’re using two simply because the results are impressive and for this problem we really don’t really need more. We’ve also thrown in two maxpool layers between neurone layers and both have quite large strides/pool sizes. If you’ve a problem that requires particularly sensitive pixel perfect measurement you might want to try removing these. Of course, the cost of doing so will be in terms of escalating hardware demands so it’s recommended to compare both.

The very last layer is a single dense (Bayesian) neurone because we’re interested in just one output. This output will be our measurement. It’s as simple as that.

Finally we compile the model with mean squared error loss (MSE). This is deceptive as although we only specify MSE we’re also adding the KL on each layer. However we defined the KL ourselves, because we’re independent Bayesianists who wanted to give Keras a well-deserved rest. We’ll see proof that KL is involved when we print the loss during training. It’s noticeably different (greater) than the MSE alone. The difference between the two is our KL.

Training

Lets start the training and take a look at that loss:

The training instruction for the Bayesian Convolutional Model (https://gist.github.com/DoctorLoop/4b10c410a709e0dfd71ace8b004255bc
[Out]:
....
Epoch 250/250
151/151 [==============================] - 1s 4ms/step - loss: 12.5878 - mse: 5.1539 - val_loss: 16.3906 - val_mse: 8.9721

There’s are few things to note here. The loss is relatively high while the batch size is relatively low!

To address the loss first we’ll repeatedly find that with Bayesian model the loss value is an even worse indicator of model performance than it is for conventional models. The first reason is because we’re combining at least two losses. Of course we’re interested in the change in loss rather than the explicit value, but even then change isn’t always clear as we often change the relative influence of the two losses progressively over the training epochs. We’ll discuss these considerations in later chapters. Just remember for now that it isn’t unusual to see a classification model with a loss of several thousand(!) while having perfect validation metrics.

While some people may scoff at my puny batch size and assume resources are scarce — they couldn’t be more wrong. With Bayesian model the batch size has a much greater influence on training than we’d expect. It’s an example of a number of areas of neural network theory we often think we understand but that’ll demand a review of our beliefs. We usually think of batch size as of predominant importance to training speed. Some people also appreciate the reduced variance a larger batch brings. However with Bayesian models batch size directly influences training performance. Have a look and see by running the same model repeatedly with a batch size of 5 and with 50. You’ll notice that when the batch size is 50 epochs are of course much quicker but we never get loss or performance metrics as good as we do with a batch size of 5. It’s not a small difference — it’s enormous! This is important because we’ll quickly discover batch size is a hyperparameter that’s influential to Bayesian deep learning success.

we’ll quickly discover batch size to is a hyperparameter thats influential to Bayesian deep learning success

While at first it seems frustrating that we’ve another hyperparameter to optimize, we’ll find ourselves being able to rocket the performance with a very simple change of architectures that are simpler than we’ve relied upon in the past (in the appendix at the bottom of this article we’ll discuss the layers like Flipout that drive the changes).

Inference

Finally we get to inference. We’re interested in making multiple predictions from our Bayesian master model. Each output be slightly different because each prediction will be made with a fresh model that’s been filled with weights sampled from the weight distributions of the Bayesian master we trained.

Two list comprehensions each generating 1000 predictions for two different input images. https://gist.github.com/DoctorLoop/09552736976a7e0a32e3f27d28a4ee1c

In the above code we use a list comprehension style for loop to make each prediction. Wouldn’t it be quicker if we just provided a single input array (1000 x 126 x 126 x 1) and make all the predictions at once? Indeed it would be much quicker. But at the same time it would defeat the purpose because it’s the separate model.predict calls that sample fresh weights from the distributions of our Bayesian training. Each predict call therefore is responsible for creating a unique new model that’s constrained by the distributions we created in training. If we made just 1 predict call with an input of 1000 images all the predictions would be identical because we’d be working with a single sample of weights, and thereby emulating a standard model. We’re more interested in the ability to exploit the infinite bag of models our Bayesian training creates. We call the bag a model ensemble, and we take advantage of the ensemble of multiple different models to get many different perspectives on the same data. The agreement of the many perspectives is most important, it tells us the quality of the data we input.

Plotting the predictions from a valid input (baby_predictions) and an invalid input (spider_predictions). https://gist.github.com/DoctorLoop/41abe385934fb0f7728ba048564e26d4
Density plot for 1000 height predictions on a single valid input (green) and 1000 height predictions on a single invalid input (red). The spread of the predictions when the model is given invalid inputs with spiders shows the disagreement between the predictions indicating high uncertainty. The similarity of measure predictions for a valid baby silhouette shows the prediction agreement indicating a confident prediction.

In the above code and figure we produce a density plot of 1000 height predictions of a single baby image (green) and a single spider image (red). We can see that predictions for the baby’s height are very tightly packed together around 51 pixels (its mean and expected value). While around 30% of predictions are at exactly this measurement (the true value coincidently) and nearly all predictions are within a single pixel of the truth! On the bother hand, while predictions for the spider also centre on a value (90 pixels) fewer than 4% of predictions are at the expected value and the predictions are far more widely dispersed (spread out) over a range going from 51pixels all the way to 134pixels. Clearly the predictions on the spider don’t agree with each other. We can intuit therefore that our Bayesian model is uncertain about predictions on invalid objects while our Bayesian model is confident about predictions related to objects from training. This is exactly how we want it to be.

In the next article we’ll explore how we can make simple Bayesian models better than complex standard models. We’ll also find out how other types of uncertainty can be exploited to guide training and how to optimise and compare models to find the very best.

Appendix: TensorFlow-Probability Convolutional Layers

If you’ve read the documentation or any papers recently you may have found different ways to tackle Bayesian deep learning. TensorFlow Probability implements two approaches for convolutional layers (more are available for dense layers). The first is the reparameterization layer:

tfp.layers.Convolution2DReparameterization. Reparameterization lets us calculate gradients via a distribution’s most likely value. We therefore manipulate the parameters that describe the distributions instead of the weight values in the neural network. Dealing with distribution parameters means the actual distribution can be ignored and is effectively abstracted away. The parameters describe the distribution can be thought of as stand-ins for the distribution object in the same as paper money stands in for real assets like gold. In both cases a stand in is preferred because it’s more convenient. In training we conveniently avoid the embarrassment of attempting backpropagation through random variables (embarrassing because it doesn’t work¹).

Reparameterization is fast but sadly it suffers from a practical need to set all the weights of examples in a batch to the same value. If weights were individually recorded instead of shared the memory requirements would skyrocket. Sharing weights is efficient but increases variance to make training require more epochs.

The flipout layer: tfp.layers.Convolution2DFlipout takes a different approach. While it’s similar it benefits from a special estimator for loss gradients. This flipout estimator shakes up the weights in a mini-batch to make them more independent of each other. In turn the shake-up reduces variance and requires fewer training epochs than the reparameterization method. But there’s a catch. While flipout needs fewer epochs it actually requires twice as many calculations! Luckily these calculations can be parallelised but we’ll still tend to find a model taking 25–50% longer per epoch (dependant on hardware) even though training requires fewer epochs in total.

¹ Without a reparametrized distribution we break the assumption that taking a large sample gives us a better estimate. While many of us don’t think of training in these terms we’re depending on the assumption all the time. So with reparameterization we describe the change in the most likely value instead of the most likely change in a sample which we can’t predict as the variable isn’t random if we can.