DeepLearning series: Deep Neural Networks tuning and optimization

Now that we got our neural network model set up (check blog), I want to cover three aspects that help you optimize your network:

  • Basics
  • Optimizing the NN
  • Tuning the hyper-parameters

BASICS

How do we know if the network we created is reliable in predicting an outcome?

As we saw at the end of the previous blog, training a neural network is an iterative process and finding good hyper-parameters helps reach faster and better results. But what’s equally important is what and how you feed your network: your data. It’s like us, humans, we can satisfy our hunger and get quickly back to work, but depending on the food we eat and how we eat it, can affect our performance.

Deep neural networks are very hungry, and perform best with a lot of data!

You’ve learned with previous “traditional” machine learning models that we need to split our data into: training, development and test sets.

Why do we need them?

Think about it this way: the development (“dev”) set is needed to evaluate the best algorithm to use, as it compares many; the test set, instead, gives us a confident estimate, that is, an unbiased estimate of the performance of the selected network.

In the “traditional” machine learning field, where algorithms could work on relatively small datasets (i.e., around 10,000 samples) the rule of thumb was to get 70% of the data as the training set, 20% to the dev set and the last 10% for the test set.

With Deep Learning, instead, where big data are involved (i.e.,> 1,000,000 samples) the rule is to split the data into 98% training, 1% dev and 1% test.

This is because, as I said, a deep neural network is starving for data, so the more data we use for training, the better the network is. Also, sometimes, it’s difficult to collect a lot of data, so you want to use the most of what you have.

One note is to make sure that the dev and test sets come from the same distribution, otherwise, we would be choosing an algorithm that works well on our dev set, but in the real world (test set) we get different the results.

It’s like preparing for a sprint run (your dev set) but then being tested on a marathon (your test set)!

Having correctly set the data sets, we can now measure more efficiently the algorithm’s bias and variance in order to subsequently improve it.

In a previous blog I described the difference between bias and variance in a 2-dimensional space.

Looking at that graph, we can quickly identify the problem, but when dealing with multi-dimensions there is not such a visual representation that can help us discern the two causes of error.

Fortunately, when we compare the training, dev and the Bayes errors, we get a clue.

(The Bayes error, is the optimal error, which in image recognition can be set as the human error, as humans are pretty good at recognizing images. So that’s our benchmark).

This is our fast rule:

if training_error > Bayes_error: high bias (underfitting)

if val_error > training_error: high variance (overfitting)

Great, now let’s see what solutions we have at hand to fix these errors:

To fix high bias (which relates to the training set performance) we can:

  • use a bigger network (more hidden layers or more hidden units)
  • train the network longer (more iterations), or
  • change the neural network architecture

To fix high variance (which reflects the dev set performance) we can:

  • gather more data
  • use regularization. or
  • change the neural network architecture

Most of those fixes are self-explanatory, so I will only spend time on explaining the different methods regarding regularization, which helps us avoid overfitting.

_ _ _ _

Regularization:

There are several regularization methods that we can use in our neural network.

L2 Regularization (or “weight decay”):

This method adds a regularization parameter (λ) to the cost function:

When computing the derivative of the weight during back propagation, used in calculating the gradient descent, we get:

which explains why this regularization is also called “weight decay”.

Now that we have the formula out of the way let’s see why the L2 Regularization makes the decision boundary “smoother”, less fit to the training data.

If we are using a big value of λ, then to offset the cost function, the weights need to get closer to 0. If the weights are almost zero, then a lot of hidden units are not active and the network becomes much simpler. Therefore it can’t fit complicated decision boundaries.

Like when we wake up in the morning, the brain is less “active” and can’t handle decisions that are too complex. But after our coffee, our neurons’ network is full and ready to deal with more complex problems.

Let’s look at this from another perspective.

To get a visual indication, let’s consider an activation represented by the tanh function, which curve is represented as:

(ref: http://mathworld.wolfram.com/HyperbolicTangent.html)

If we have a big value of λ, then the weights are closer to 0, which means that “z” is also small since:

and “a”, the activation function, is also small since a=g(z).

As we chose the activation function to be is equal to tanh(z), for small values of z, it is basically a linear function. (to understand what I mean, look at the curve above and see the behavior of the curve in a range of x values around 0).

So if every layer is represented by a linear function, then our network is linear and it doesn’t fit complicated decision boundaries.

Just keep in mind, though, that if the value of λ is too big, then we risk to “oversmooth” and therefore we end up being affected by high bias.

Dropout regularization:

When we add a dropout layer in the neural network, defining a “keep_prob” value, we tell the network to set a probability of eliminating a node at each training example.

Let’s look at the following network:

If we set the probability of keeping each node to 50% (keep_prob=0.5) in the two hidden layers, then the network randomly eliminates 50% of the nodes in each layer (two nodes per layer) at each training sample. So the two eliminated nodes, randomly chosen, are different at every training example. So the network might look like this:

and we can see now that, having eliminated 50% of the nodes, has made the network much simpler.

When we use a dropout function, the network can’t rely on any single feature since at any given time this might be suppressed. So it has to spread out the weights, not putting much weight on any one feature.

It’s like having a lot of friends but no one is really that reliable to completely trust (“give a lot of weight”).

Above, I sketched the same probability for each layer, but we can select to vary it layer by layer. Normally, for layers with a lot of parameters we use a low value of keep_prob (so high chances of eliminating the nodes) to avoid overfitting.

In applications such as computer vision, where we normally don’t have a lot of training data, the use of dropout is very common, as it makes the network less fit to the data.

This technique is used during training, and not during test time, otherwise we would only add noise to the predictions.

A good practice is to first train the network without the dropout layer, so we can compute the cost function at each iteration, and see if the gradient descent is performing correctly (decreasing the cost monotonically). Then we activate the dropout.

Data Augmentation

As you know, getting more training data helps to overcome overfitting. Sometimes, however, getting more data might be difficult or expensive, so one technique is to augment the existing training examples.

In the case of images, for example, we can flip the image horizontally (or rotate it) and add this new image to the dataset, or zoom and crop part of the original image.

This is an inexpensive way to gather more data and regularize the algorithm, reducing overfitting. These new data, though, don’t introduce more info than a brand new independent example would.

Early stopping

As we are training, and run the gradient descent, we plot the curve of the cost function with respect to the number of iterations. We do this for both the training sample and the dev set. When we see the curve of the cost on the dev set reaching the minimum, then slightly increasing, it’s a sign that we are overfitting the data and we should stop iterating.

Unfortunately, this method is not ideal since it is affecting two orthogonalization problems at the same time. Specifically, it tries to fix overfitting, but it also touches the cost function because stopping the iterations goes against the goal of optimizing the cost.

In machine learning, instead, we want to work on one orthogonalization problem at a time:

– optimize the cost function (which we are going to talk next)

– reduce overfitting

We always want to turn one “knob” at a time to see how to fix the “machine”, where each knob? has a specific effect. Early stopping is instead a knob that affects a couple of things at the same time and that doesn’t help us figure out what is going on.

__________________

OPTIMIZATION:

There are many techniques we can use to speed up training in a deep neural network.

Normalizing inputs

  1. First, subtract out the mean from each training input. This will center the data. Think about a scattered plot of two training data,

the mean would be:

and subtracting it from each input as x = x — μ would allow the data to re-adjust as:

2. Then, normalize the variance, so that each feature would have variance equal to 1. (i.e. In the plot above x1 has a much larger variance than x2). Calculating the variance:

and dividing it from x:

will get our data reshaped as:

The logic, therefore, is that if we don’t normalize the inputs, then consequently the cost function will be very elongated, while with normalization, it will become more symmetric.

You will see what I mean (and I will explain the consequences) in a second, when I plot the cost function.

As you can see above, when I run the gradient descent on the cost function for the not-normalized inputs (elongated shape), I need to use a small value of learning rate. It might take a lot of small steps to converge.

On the other hand, for a normalized version (symmetric circles), wherever I start, I can take the same steps (even larger steps) to get to the minimum.

Weight Initialization

I mentioned this topic in an earlier blog, but I want to reiterate the concept, as it is one of the optimization methods you can use to train your deep network.

If you have implemented several networks you have certainly experienced the difficulty of training it. Depending on the weights’ value, the activations can decrease exponentially as a function of the number of layers (if the weights are smaller than 1), or increase exponentially if they are above 1.

This problem is called the “vanishing/exploding gradients”. A partial solution to that is to properly initialize the weights.

Logically, since

the larger the number of input features is (n), the smaller we want w_i to be.

Therefore, we can set the variance of wi to be 1/n.

This means that the weights can now be initialized as:

and z now will be on a similar scale.

Just a quick note, if we are using “ReLU” activation we set the variance to be 2/n, instead of 1/n, which is used when dealing with a “tanh” function.

Gradient checking

This is a debugging method that helps you figure out if backpropagation is working. It numerically checks the derivatives computed by your code to make sure that your implementation is correct.

So we know that the gradient of the cost function is the derivative of it. Let’s recall the mathematical computation of a derivative:

This can be numerically approximated as:

Therefore

Great, so now we can take w[1], b[1], …, w[L], b[L] and concatenate them and reshape into a big vector θ.

The same we do for dw[1], db[1], …, dw[L], db[L] which will be stack in a vector dθ.

What we simply do, for each “ i ” of θ, we compute the dθapprox. with the formula above. If everything is correct, this should bring to a value close to dθ calculated by the gradient descent.

If the algorithm fails gradient checking, then we can look at which component (i.e. dw[1], db[1], …) within the network is failing, since our vector dθ is a vector composed of those elements.

Mini-batch gradient descent

The concept here is simple (I promise!). Instead of training the network with the entire training set, we will train it on small batches (small chunks) of it.

This gives us the advantage of seeing gradient descent make some progress even before processing the entire training set and, furthermore, helps to not saturate the memory of our CPU/GPU.

We always want to eat small bites of cake to check if we like it, instead of eating the whole thing and then realize it’s not quite our taste!

Side note: when we train on the entire training set then our cost function decreases at every iteration. While training on mini-batches it might not do so at every iteration since each mini-batch is composed of different sets. But overall the cost function will decrease at every epoch! (Recall, one epoch is a single pass through the whole training set).

Finally, watch the size of each mini-batch. At the extremes, if mini-batch size is equal to the entire training set (m), then it takes a long time for each iteration. On the other hand, if the mini-batch size is composed of just 1 example (stochastic gradient descent), then we lose the speed gained from vectorization.

A good rule of thumb is to train on the entire set if the training dataset is fairly small (< 2,000 data), while otherwise choose a mini-batch size of 64, 128, 256, 512.

Gradient descent with momentum

It computes an exponentially weighted average of the gradients and uses that gradient to update the weights.

Let’s start with an example. We are trying to minimize a cost function that has an elliptical shape like the one depicted below. The gradient descent is trying to reach the minimum and it oscillates throughout its path. To prevent the oscillations, which means to prevent overshooting, we are forced to use a small learning rate. This not only certainly reduces oscillations, but also slows down the process of reaching the minimum.

Ideally, our goal would be to have a slow learning process on the vertical axis (small oscillations) while a faster learning on the horizontal axis.

Well, I’ve got good news for you. That’s exactly what “gradient descent with momentum” does!

On every iteration, we compute the moving average of dw and db and use that (instead of the derivatives) to update the weights and the bias. So:

We can think of the cost function represented as a bowl and we have a ball that is moving along it to reach the bottom (the minimum).

In the above equations, the derivatives (dw and db) represent the “acceleration” of the ball rolling down, while the term Vdw and Vdb represent the velocity of the ball. The parameter β is representing the friction to the ball that is gaining momentum along its way to the minimum.

RMSprop (root mean square prop):

This is another algorithm that can speed up gradient descent. As before, it moves quickly on the horizontal axis and slows down the oscillation on the vertical one. The way it gets to that is represented by the following equations:

Adam algorithm (Adaptive Momentum Estimation)

This algorithm combines the effect of “gradient descent with momentum” with RMSprop. This is one of the most effective algorithms, and one I tend to use most of the times.

For every iteration t we compute:

As we can see from the equations that govern this model, there are several parameters involved, but, fortunately, the inventors of this algorithm recommend the best values to use.

So normally you have:

Learning rate decay

The concept here is to slowly reduce the learning rate over time. Why?

Well, let’s think about an implementation with small mini-batch gradient descent. We know that each step will be a bit noisy and that it will get towards the minimum but never quite converge to the minimum, while instead wandering around it. The way to fix it would be to reduce the learning rate (remember, reducing the learning rate reduces the oscillations, making the steps slower and smaller) when we get closer to the minimum, to allow a tighter oscillation around the minimum.

So that sounds awesome! We can take advantage of a large learning rate at the beginning, which favorites faster learning, and then lower the value of the learning rate when we get closer to the minimum.

So this is how we can set our learning rate α:

(recall that 1-epoch is one pass through the entire training set)

The “decay_rate” is obviously a parameter that we need to tune.

_ _ _ _ _ _ _ _

HYPER-PARAMETERS TUNING

As you have seen throughout this blog, each optimization algorithm has its hyper-parameter(s) that need to be tuned to gain the best performance.

We saw:

The learning rate α, β (when dealing with gradient descent with momentum), β1, β2 and ε (for the Adam algorithm) and then the # of layers, the # of hidden units, the mini-batch size, the learning rate decay.

Yeah, there’s a lot of them. You might be wondering … where should I start?

Andrew Ng gives us his suggestions to answer to that question.

The most important parameter, according to Ng, is certainly the learning rate α. This drives the whole process of learning (duh!) and if we use a wrong value (i.e., too large), we can overshoot or (i.e., too small) slooowly get to a convergence.

Then, we can focus on β, mini-batch size and # of hidden units.

Finally, we can get our hands on the # of layers and the learning rate decay.

Luckily, we already saw that Adam’s parameters are good as they are set by default: β1 = 0.9, β2 = 0.999.

_________________

Before closing this blog (I know, I know, I’ll be quick) I want to mention an algorithm, Batch Normalization, that is really helpful in making the network pretty robust and in making the hyper-parameters search problem easier since it “allows” a bigger range of hyper-parameters that work well.

BATCH NORMALIZATION

I mentioned earlier about normalizing the input features, which helps to speed up learning, computing the mean and the variance. Well, the batch normalization algorithm does that for the hidden layers, before applying activation.

It also makes the weights later (or deeper) in the network more robust to changes to weights in earlier layers. Therefore, it makes each layer more “independent” and ready to learn by itself, speeding up the whole learning process.

Finally, it also has a regularization effect. In fact, when using mini-batch gradient descent and batch normalization, each mini-batch is scaled by the mean/variance computed on just that mini-batch. This adds some noise to the values z[l] within that mini-batch. So, similar to the dropout regularization, it adds some noise to each hidden layer’s activations; it can’t “rely” on each of them as we saw for dropout.

This slight regularization effect, though, is reduced if we increase the size of the mini-batch.

This blog has been based on Andrew Ng’s lectures at DeepLearning.ai

Source: Deep Learning on Medium