Original article was published by K. Sai Chaitanya on Deep Learning on Medium
Ann: The ‘Crux’ of Deep Learning
Getting conversant with what is back propagation and to what extent does this empower ANN.
In this article, I’ve made sure that I include each and every crucial detail of Artificial Neural Network from scratch, for the readers to get a comprehensive view of this concept in a simplistic interpretation.
What is a Neuron?
I believe most of them were perplexed when confronted with this word ‘Neuron’ for the first time, but let me assure you, it’s pretty interesting and not that confusing you assume it be.
A ‘neuron’ in deep learning can be analogically referred to a ‘neuron’ in biology. Look at this, this is a typical biological neuron.
When we touch a hot cup with our index finger, we reflex right?. This is to say that we felt something on the tip of the finger which involuntarily rebounded our entire hand, So how do we know it is hot? and what caused this reflex action?.
When we touch the hot cup, that particular neuron which is present at the tip of our finger gets activated and starts sending electrical signals to the brain through a chain of neurons and the brain finally forces our body to cause reflexes.
So now we can consider a neuron as a source of communication to our brain through electrical pulses.
What is a Neuron in Deep Learning?
A neuron in deep learning is called as an Artificial Neuron and correspondingly a chain of neurons is referred to as a Artificial Neural Network. The structure of Neural Network in Deep learning is given below.
Yes. It is pretty complicated. Understanding the process which is happening in the background is similarly sophisticated, so lets break it apart and analyze the structure of a single neuron. Now Look:
x1, x2, x3 and so on upto xn are features that are acting as inputs to the machine, the one in between is a node which is responsible for summing up the product of weights and input features and applying activation function on the resultant quantity, finally the third layer is the output layer which predicts the outcome.
Layers in a Neural Network:
Input Layer: This layer in technological terms can be compared to eyes in biological terms. Just as we discern the features of a subject by sight(Let’s say classifying an animal as a dog or a cat. We see and distinguish them based on their attributes or features, for instance: ears, mouth, tail, fur, eyes, teeth, nose etc.), the input layer of the Artificial Neuron is responsible for discerning the features of a subject. So the input layer must contain all the attributes or features used for classification.
Here, xi represents the individual features
Hidden Layer: This layer can be compared to nucleus of a neuron or a node in biological terms. Just as it activates and transfers the electrical signals to the brain, the machine like neuron similarly sums up the product of weights and input features and then passes it through an activation function which activates that particular neuron. Here weights play a major role, lets put this aside, we will get back to this with a detailed explanation.
Function of each node in a hidden layer:
- Multiplying each feature to the corresponding weight of that branch.
- Summing up all the products of weights and features.
- Applying activation function on the resultant output
Output Layer: This layer is analogous to our brain in biological terms. Here as in material world we classify the subjects, the output layer in the machine like world is responsible for segregating subjects with ‘unique’ features and giving the output in the binary format.
1. Binary Step Function
The first thing that comes to our mind when we have an activation function would be a threshold based classifier i.e. whether or not the neuron should be activated based on the value from the linear transformation.
In other words, if the input to the activation function is greater than a threshold, then the neuron is activated, else it is deactivated, i.e. its output is not considered for the next hidden layer. Let us look at it mathematically-
f(x) = 1, x>=0
= 0, x<0
Moreover, the gradient of the step function is zero which causes a hindrance in the back propagation process. That is, if you calculate the derivative of f(x) with respect to x, it comes out to be 0.
2. Linear Function
We saw the problem with the step function, the gradient of the function became zero. This is because there is no component of x in the binary step function. Instead of a binary function, we can use a linear function. We can define the function as:
Here the activation is proportional to the input.The variable ‘a’ in this case can be any constant value. When we differentiate the function with respect to x, the result is the coefficient of x, which is a constant.
f'(x) = a
Although the gradient here does not become zero, but it is a constant which does not depend upon the input value x at all. This implies that the weights and biases will be updated during the back propagation process but the updating factor would be the same.
In this scenario, the neural network will not really improve the error since the gradient is the same for every iteration. The network will not be able to train well and capture the complex patterns from the data. Hence, linear function might be ideal for simple tasks where interpretability is highly desired.
The next activation function that we are going to look at is the Sigmoid function. It is one of the most widely used non-linear activation function. Sigmoid transforms the values between the range 0 and 1. Here is the mathematical expression for sigmoid-
f(x) = 1/(1+e^-x)
A noteworthy point here is that unlike the binary step and linear functions, sigmoid is a non-linear function. This essentially means -when I have multiple neurons having sigmoid function as their activation function,the output is non linear as well.
Additionally, as you can see in the graph above, this is a smooth S-shaped function and is continuously differentiable. The derivative of this function is:
f'(x) = sigmoid(x)*(1-sigmoid(x))
The gradient values are significant for range -3 and 3 but the graph gets much flatter in other regions. This implies that for values greater than 3 or less than -3, will have very small gradients. As the gradient value approaches zero, the network is not really learning.
Additionally, the sigmoid function is not symmetric around zero. So output of all the neurons will be of the same sign. This can be addressed by scaling the sigmoid function which is exactly what happens in the tanh function. Let’s read on.
The tanh function is very similar to the sigmoid function. The only difference is that it is symmetric around the origin. The range of values in this case is from -1 to 1. Thus the inputs to the next layers will not always be of the same sign. The tanh function is defined as-
As you can see, the range of values is between -1 to 1. Apart from that, all other properties of tanh function are the same as that of the sigmoid function. Similar to sigmoid, the tanh function is continuous and differentiable at all points.
Let’s have a look at the gradient of the tan h function.
The gradient of the tanh function is steeper as compared to the sigmoid function. You might be wondering, how will we decide which activation function to choose? Usually tanh is preferred over the sigmoid function since it is zero centered and the gradients are not restricted to move in a certain direction.
The ReLU function is another non-linear activation function that has gained popularity in the deep learning domain. ReLU stands for Rectified Linear Unit. The main advantage of using the ReLU function over other activation functions is that it does not activate all the neurons at the same time.
This means that the neurons will only be deactivated if the output of the linear transformation is less than 0. The plot below will help you understand this better-
For the negative input values, the result is zero, that means the neuron does not get activated. Since only a certain number of neurons are activated, the ReLU function is far more computationally efficient when compared to the sigmoid and tanh function.
Let’s look at the gradient of the ReLU function.
f'(x) = 1, x>=0
= 0, x<0
If you look at the negative side of the graph, you will notice that the gradient value is zero. Due to this reason, during the backpropogation process, the weights and biases for some neurons are not updated. This can create dead neurons which never get activated. This is taken care of by the ‘Leaky’ ReLU function.
6. Leaky ReLU
Leaky ReLU function is nothing but an improved version of the ReLU function. As we saw that for the ReLU function, the gradient is 0 for x<0, which would deactivate the neurons in that region.
Leaky ReLU is defined to address this problem. Instead of defining the Relu function as 0 for negative values of x, we define it as an extremely small linear component of x. Here is the mathematical expression-
f(x)= 0.01x, x<0
= x, x>=0
By making this small modification, the gradient of the left side of the graph comes out to be a non zero value. Hence we would no longer encounter dead neurons in that region. Here is the derivative of the Leaky ReLU function
f'(x) = 1, x>=0
Apart from Leaky ReLU, there are a few other variants of ReLU, the two most popular are — Parameterised ReLU function and Exponential ReLU.
Note: In the output layer, it is always recommended to use Sigmoid activation function so that the output can be precisely classified as either 0 or 1.
What are weights ?
Weight is the parameter within a neural network that transforms input data within the network’s hidden layers. A neural network is a series of nodes, or neurons. Within each node is a set of inputs, weight, and a bias value. As an input enters the node, it gets multiplied by a weight value and the resulting output is either observed, or passed to the next layer in the neural network. Often the weights of a neural network are contained within the hidden layers of the network.
“weights are simply defined as the amount to which the input features affect the output”
What is a Loss Function?
When the output layer classifies the subject as a positive or a negative quantity, (for instance 1 for dog and 0 for not a dog), then it is inevitable that there will be errors for some observations as machines are not always perfect in prediction. So our job is to minimize the error function which is called as Loss function such that the prediction will be redeemed to classify the subject as precisely as possible.
Loss function is defined as the sum of (squares of differences in y actual and y predicted) of all the n records
For this to happen( loss function to decrease), we have to introduce an indispensable concept underlying ANN which is “Back Propagation”.
During a back propagation, the computer automates itself in such a way that the weights are adjusted automatically by traversing backwards from output layer to the input layer. The main concern behind back propagation is to adjust the weights, for it reduces the Loss function.
The weights are adjusted in such a way that it satisfies the following formula:
- *Wx in the above formula is the new weight.
- Wx is the old weight
- a is the learning rate, which decides to what amount should the weight be dropped down in order to reach the global minima. The learning rate should neither be too small nor too large. If the learning rate is too small, then the weights never reach the weight at the global minima. Whereas, if the learning rate is too large then the weights oscillate between the sidewalls of ‘U’ shaped curve but never reaches the global minima.
- Finally the derivative represents the slope of the line at that weight. The main intuition behind this is, when the slope is negative, the old weight is augmented by a small amount. conversely, when the slope is positive, the old weight is reduced by a small amount.
The gradient descent requires all the records of the data set in order to process the Loss Function and to further the updation of weights to ensure the Loss Function decreases.
Stochastic gradient descent and Mini-Batch Gradient descent:
- SGD uses only one example at a time.
2. In SGD, because it’s using only one example at a time, its path to the minima is noisier (more random) than that of the batch gradient. But it’s ok as we are indifferent to the path, as long as it gives us the minimum AND the shorter training time.
3. Mini-batch gradient descent uses n data points (instead of 1 sample in SGD) at each iteration.
Optimization is conducive for the weights to reach the global minima at a faster rate.
An epoch is defined as the combination of one forward and one backward propagation. When we increase the epochs, the error will be more likely to diminish.
1 epoch = 1 forward propagation + 1 backward propagation
Dropout is a regularization technique for reducing over-fitting in neural networks by preventing complex co-adaptations on training data. It is a very efficient way of performing model averaging with neural networks. The term “dropout” refers to dropping out units (both hidden and visible) in a neural network. A simple and powerful regularization technique for neural networks and deep learning models is dropout.
How does the drop out technique work?
Dropout is a technique where randomly selected neurons are ignored during training. They are “dropped-out” randomly. This means that their contribution to the activation of downstream neurons is temporally removed on the forward pass and any weight updates are not applied to the neuron on the backward pass.
As a neural network learns, neuron weights settle into their context within the network. Weights of neurons are tuned for specific features providing some specialization. Neighboring neurons become to rely on this specialization, which if taken too far can result in a fragile model too specialized to the training data. This reliant on context for a neuron during training is referred to complex co-adaptations.
You can imagine that if neurons are randomly dropped out of the network during training, that other neurons will have to step in and handle the representation required to make predictions for the missing neurons. This is believed to result in multiple independent internal representations being learned by the network.
The effect is that the network becomes less sensitive to the specific weights of neurons. This in turn results in a network that is capable of better generalization and is less likely to over-fit the training data.
P is the Dropout factor which can be determined by Hyper parameter tuning.
What is the need for optimization Algorithms?
As we’ve discussed earlier in this article that, to train a neural network model, we must define a loss function in order to measure the difference between our model predictions and the label that we want to predict. What we are looking for is a certain set of weights, with which the neural network can make an accurate prediction, which automatically leads to a lower value of the loss function.
I think you must know by now, that the mathematical method behind it is called gradient descent
By periodically applying the gradient descent to the weights, we will eventually arrive at the optimal weights that minimize the loss function and allow the neural network to make better predictions.
In practice, this technique may encounter certain problems during training that can slow down the learning process or, in the worst case, even prevent the algorithm from finding the optimal weights
These problems were on the one hand saddle points and local minima of the loss function, where the loss function becomes flat and the gradient goes to zero:
A gradient near zero does not improve the weight parameters and prevents the entire learning process because the derivative( the slope of the line at global minima is zero).
On the other hand, even if we have gradients that are not close to zero, the values of these gradients calculated for different data samples from the training set may vary in value and direction. We say that the gradients are noisy or have a lot of variances. This leads to a zigzag movement towards the optimal weights and can make learning much slower.
There are different optimization algorithms. I recommend you to click on the hyper links below to get a comprehensive understanding about each and every optimizer.
“ I’ve posted the code regarding applying Artificial Neural Networks on Churn-Modelling in my github. I would recommend you to download the code in your local systems and execute for a better understanding of how each layer functions.”
My github link to that code is given below: