Neural Networks III: The Big Picture

Source: Deep Learning on Medium

This series of posts on Neural Networks are part of the collection of notes during the Facebook PyTorch Challenge, previous to the Deep Learning Nanodegree Program at Udacity.

Contents

1. Introduction
2. Networks as functions
3. Let’s Increase the Dimensions
4. Batch Size

1. Introduction

In the first chapter, we have seen what is a neural network, and what can we expect from them. This approach was looking at the neural networks from the outside. They are black boxes that we have seen are actually performing mathematical operations.

In order to fully understand neural networks, it’s time to look at them from the inside. In this chapter, we will look at how these neural networks are decomposed into the smallest blocks, and how to code one from scratch based on the behaviour we will be demystifying. Indeed, this decomposition we are going to do is the reason why neural networks and computationally efficient, and we can use them to solve complex problems where large datasets are required.

To start with, let’s take a look at our old friend again, and let’s try to figure out what are the smallest pieces we can extract from it.

The only difference is that the dimension of the input and the neurons in the hidden layer will be fixed, to get a better understanding while we go over the process. We are going to discover on the way where all the magic comes from.

2. Make the network a function

As we so in the previous part, neural networks are just a representation of mathematical operations that are taking place. Therefore, we can represent all the variables as vectors in the different dimensions.

First, we have an input of 2 dimensions. This means, that we have 2 different concepts, for example, temperature and pressure. Our input neurons are mapping these 2 input dimensions to 3 hidden dimensions, which is the number of neurons in the hidden layer. Thus, the dimensions of that matrix must be (2×3). Also, it is not represented for clearer visualization, but we have a term ‘b’, to add the bias after the matrix multiplication. This lead as to this situation:

Now we have all the tools we need to reach the hidden layer. We just need to apply the matrix multiplication:

The result of this will be:

Remember that each input weight is connecting every input i with every neuron j. That is why the notation Wij, connecting i with j.

Once we are in the layer, it’s time to apply the activation function. The result of this will be:

We are applying the function, element-wisely, to every element of the network input of each neuron, so we keep our dimension of 3 as the number of hidden neurons.

The transformation required now according to the neural network scheme is from 3 to 1, as the next layer only has 1 neuron. The last neuron usually does not perform any function as we don’t want to apply a non-linearity at this point. Thus, the matrix that connects these two layers must be (3×1).

Now we have all the tools to apply the weights of our hidden layer. The equation that needs to be perform then is:

As our final input does not perform any operation (yet), we can say that z3 is the final output (called Y_hat in previous parts) after the inputs have gone through all the network. This is what we called feed forward operation.

3. Let’s increase the dimensions

We have gone through the whole forward pass for a single observation of 2 dimensions. So, for example, we had Temperature=20ºC, P=1bar. But who can learn anything with only one observation? We need to know more in order to improve our knowledge about what is surrounding us and be able to find patterns on it. If we want to learn how to play chess, we definitely have to try it several times. Does this fact complicate the process? Of course not! Let’s go through it.

Let’s say that, for example, we have 4 observations (of course you need much more, but I just picked the smallest number different from the already used 1,2,3, so at the end you can make the relationships easier; we will reach that point don’t worry).

Thus, our input looks like this:

Now, the variables corresponding to the network itself remain constant. We have not changed anything in the network, we are just using a batch of observation as input, instead of a single observation. Therefore, if we perform the forward, our results are:

Again, the only difference is that we have a new row per each different observation. The next steps are to apply element-wisely the activation function with z2 as the net input of each hidden neuron. Therefore, the shape of a3 is exactly the same size as z2.

And finally, applying Eq. 2 to calculate the output:

4. Batch Size

We have been using a batch of input examples. In deep learning, one of the most important breakthroughs is small modification to the well-known optimizer, the stochastic gradient descent (SGD), into mini-batch stochastic gradient descent.

What is the difference?

Well, the vanilla implementation of the SGD will input a single input and perform the forward pass. This will result into an error that will be back-propagated into the network to update the weights.

The mini-batchSGD takes, as seen is Section 3, a batch of inputs and forward them all together in the forward pass, averaging the errors of each of them into a single error for the mini-batch.

What is the advantage of using the mini-batch?

Firstly, since we are doing 1 weight update every batch of inputs, the number of total updates will be way less! Therefore, we are reducing the number of computations in the training, and consequently, reducing the training time.

Secondly, both the hardware and the software are optimized to perform matrix multiplications. Therefore, it will take less time to do the batch-size number of computations in a single matrix multiplication than sequentially.

Lastly, it turns out that the optimizer has nicer convergence properties using mini-batches, helping the training a lot.

This kaggle kernel is a great source to better understand how the batch-size affects the optimizer.