Original article was published on Deep Learning on Medium
Understanding Neural Networks
This article focuses on in-depth understanding of Neural Network architecture. Later we will try to implement this in a jupyter notebook.
In my previous article, I have briefly discussed deep learning and how to get started with it. If you haven’t read that article, please read it here to get an intuitive idea about deep learning and machine learning.
If you have already read it, then let’s get started!
Perceptron?! Some of you may think what is it? Some of you may know about it already. Anyway, a perceptron is the structural building block of a neural network. As simple as that. Combining many perceptrons by forming layers ends up being a deep neural network. A perceptron architecture may look like this:
Here, there are 2 layers in total: an input layer and an output layer. But, in the machine learning world, developers don’t consider input as a layer and hence they will say, “this a single layer perceptron model”. So, when someone says, “I have build a 5 layer neural network”, don’t count the input as a layer. So, what does this perceptron model do? As you can see in the above diagram we have 2 inputs and one single node with sigma and integration signs on it and then there is the output. This node computes two mathematical expressions to give the output. First, it takes the weighted sum of the input plus a bias and then the sum is passed through a non-linear activation function. Later the activation function produces predicted output. This whole process is called forward propagation in the neural network.
Have a look at this image:
The inspiration for the forward propagation is taken from logistic regression. If you know the logistic regression algorithm, this may seem you familiar, but if you don’t know logistic regression, it’s not necessary. The weights (W) and the biases (b) are the parameters that are “trained” by the neural network, and by “trained” I mean they are set to a precise value such that the loss is minimum.
The output ( “y” in the above diagram) is the prediction made by the neural network. The difference between the actual value and the predicted value is called the loss of the neural network. But it is not as simple as that. We do take the difference between the predicted value and the actual value but not the direct difference. Let’s understand what I want to say.
The Loss and the Cost function
One thing you should know before moving on is that the predicted value calculates the loss of the neural network and the neural network do so by the calculation of “Z” which is dependent on “W” and “b”. Ultimately, we can say that the loss is dependent on “W” and “b”. So, the “W” and the “b” should be set to a value that gives minimum loss. To be clear, a neural network always minimizes the loss instead of maximizing the accuracy.
When solving a deep learning problem, the dataset is huge. For example, let’s say that we have to build an image classifier that classifies the image of cats and dogs (you can consider it as the “Hello World!” of computer vision 🙂 ). So for training the neural network, we need as many images of cats and dogs as we can get. In machine learning, the image of a dog or a cat is considered as a “training example”. For training a good neural network, we need a good number of training examples. The loss function is the loss calculated for a single training example. So, actually what we optimize for the training of a neural network is the cost function. The cost function can be defined as the average of all losses calculated separately for each training example.
Let us assume there are “m” number of training examples. Then the cost function is:
Let’s take the loss of a neural network as:
Loss (say, L) = Predicted value ( say, yhat) — Actual value (say, y)
Since the loss of a neural network depends on “W” and “b”, let’s plot the above loss function with respect to “W” only ( for the sake of simplicity, if we take “W” and “b” both into consideration, we have to plot a 3-D graph and it would be hard to understand the concept. Also, the bias “b” is used to shift the activation function to right or left, like the intercept in the equation of line)
The plot might look something like this:
The problem with this loss function is that it is a straight line and it becomes impossible to optimize the loss function with the use of optimizing algorithms like the gradient descent (tell you about it in the next section). For now, understand that the loss function should be a curve that has a global minimum.
Later researchers and came up with another loss function:
This is called “Mean Squared error loss” and it is used in regression type of problems. Also, for one training example, this seems to be a fair equation. It is a parabola. But for “m” training examples, i.e, for the cost function is will be a wavy curve with lots of local minima. We want a bowl-like curve or a convex curve for optimizing the cost.
So, later we got an accepted equation for the loss function which is:
This equation is called “cross-entropy loss” and it is widely used for classification problems in deep learning.
The cost function for the above equation will look something like this:
Now we know how to compute the cost of a neural network, let’s understand how to optimize the cost function for better performance.
Backpropagation: Training the neural network
Backpropagation is the most important task in the neural network building. It is the process where the actual training of the neural network happens. It is a highly computational task. In fact, it is two-third of the whole computational process in a neural network. In forward propagation, we saw how to calculate the cost of a neural network. In backpropagation, we use this cost to set the value of “W” and “b” such that it can minimize the cost of the neural network.
In the beginning, we initialize weights (W) and biases (b) to some random small numbers. As the model trains, the weights and biases get updated with new values. This update is done with the help of an optimization algorithm called gradient descent.
In mathematics, gradient means slope or derivative. Descent means decreasing. So in layman language, gradient descent means decreasing slope. Familiarity with calculus will help to understand gradient descent. Remember when I said that we need a convex shape cost function for optimization. The reason is that while descending, we will not stuck at local minima. If the cost function is wavy, it has chances to stuck at one of its local minima and we will never have the global optimum value of the cost.
The above representation of the gradient descent algorithm may help you to understand it.
In gradient descent algorithm we calculate the derivative of computed cost with respect to the weights and biases separately.
Let us consider a simple perceptron model with 2 inputs:
Then the weight is updated as:
The alpha in the above equation is called the learning rate of the neural network. It is a hyperparameter and I will tell you about it later after I tell you about some more hyperparameters. For now, consider it as a constant.
The weights and the biases will get updated until the cost function reaches its global optimum value. Thus we will get the predicted output will less error.
One forward propagation and one backpropagation together are counted as 1 iteration or epoch of training. A deep learning practitioner has to set the number of epochs (another hyperparameter) before training the model.
Deep Neural Network
Until now we have worked on a perceptron model and it is quite easy dealing with a perceptron model. But things get ugly when we get deeper into the network.
Real-world problems have a huge number of input features and each input feature has its own weight and bias and according to the problem there are many hidden layers in the model and each hidden layer has many nodes that compute Z and A. The process of training is same as I described it earlier, but it is repeated as many times as the number of nodes are present in the neural network.
In the above GIF, you can guess the complexity of a neural network and the high computational requirement for implementing it.
Let’s summarize everything we learned till now:
1. During Forward Propagation:
– Initialize the weights and biases.
– Calculate Z and A for each node.
– Calculate the cost of the whole model.
2. During Backpropagation:
– Use gradient descent to optimize the cost.
– Compute gradients of the cost with respect to weights and biases separately.
– Update the parameters (W & b)
3. Repeat steps 1 and 2 until the cost function reaches its global optimum value.
I hope that I was clear throughout the article and you understand the concepts well. If not, feel free to ask questions on the comments. Also, give your valuable suggestions in the comment so that I can improve my articles. You can also suggest what topic you want to learn in the field of deep learning.
Thank you all for giving this article your valuable time to read it.