Source: Deep Learning on Medium
Neural Network which we are going to consider here are the Artifical Neural Networks(ANN). As the name suggests ANN are computing systems that are inspired by the functioning and the structure of the human brain.
In this article, we will be discussing what is a Neural Network and how does a Neural Network works.
What is a Neural Network?
Before understanding what is an ANN, we have to first understand how neurons communicate with each other in a brain. So lets begin….
The human brain is composed of billions of cells called neurons. Information travels in electric signals inside neurons. Any information that needs to be communicated to any other part of the brain is gathered by the dendrites in a neuron, which is then processed in the neuron cell body and is passed to other neurons through the axon. The next neuron can accept or reject the traveling electric signal based on its strength.
So now let’s try to understand what is an ANN :
An ANN is a collection of artificial neurons that tries to imitate the functioning of a biological neuron inside a brain. Each connection can transmit a signal from one node to another, where it can be further processed and transmitted to the next connected artificial neuron. ANN can learn and differentiate between very complex patterns that are difficult to manually extract and fed to a machine. An ANN consist of three types of layers, each layer is a stack of artificial neurons. The number of artificial neurons in each layer can vary depending on your personal choice.
To get a better understanding of ANN’s, let us understand the functioning of each of these layers.
- The input layer is composed of artificial neurons that take the input data into the system for further processing by the subsequent artificial neurons. The input layer is present at the start of the ANN.
- The hidden layer is between the input layer and the output layer. It takes in a set of weighted inputs and produces output through an activation function. This layer is named hidden because it does not constitute the input or the output layer. This is the layer where all the processing happens. There can be one or more hidden layers depending on the complexity of the problem.
We will be discussing weighted inputs and activation function more in detail in the latter part of this article. For right now just keep in mind that a modified input is fed into the hidden layer and some processing happens inside the artificial neurons of that hidden layer which produces an output that is used as an input for the next hidden layer or output layer.
- The output layer is the last layer of ANN architecture that produces output for a particular problem. This is the last layer where any processing happens before predicting the outcome of a problem.
What happens inside a neuron?
To get a good grasp on ANN’s working we need to understand what actually is happening at the neurons. So let’s dig in…
This ANN architecture consists of n inputs and a single artificial neuron and a single output y_j. Here w1, w2….wn are the strength of the input signal corresponding to x1, x2….xn respectively.
The above image shows a single artificial neuron. At each neuron, the ∑ xiwi is calculated by multiplying the weights to their respective inputs and summing them up. On which activation function(f) is applied. Activation is nothing just a non-linear function that is applied to the ∑ xiwi to add non-linearity in the architecture. By non-linearity I mean that most of the problems we face in day to day life are complex and cannot be solved by linear functions, activation function helps us to add non-linearity in our architecture to get better results for such problems. A neural network without activation function is just linear regression. The activation function helps it to learn more complex tasks and predict the outcomes.
Few of the most used non-linear activation functions are:
- Sigmoid function →f(x) =1/(1+ exp(-x))
It’s used in the output layer of binary classification problems which results in 0 or 1 as output. As the value of the sigmoid function lies between 0 and 1, we can predict easily to be 1 if the value is greater than 0.5 else 0. For multiclass classification problems, the Softmax function is the most common choice. The softmax function is similar to the sigmoid function. You can read more about the Softmax function here.
- Tanh function → f(x) = 2/(1 + exp(-2x))-1
It’s used in the hidden layers of neural networks as its value lies between -1 to 1, hence the mean of the hidden layer comes out to be 0 or close to 0. This helps in learning for the next layer easier.
- RELU function →f(x) = max(0,x)
RELU is much faster than the sigmoid and Tanh function. It involves simple mathematics. If you don’t know which activation function to choose, then simply use RELU as it is the most general activation function and is used in most.
You can read more about the other activation functions here.
How does a neural network work?
Now that we have an idea of how the basic structure of ANN architecture look likes and how does few of its component functions, we are ready to understand how does a neural network actually works.
We will be using(Image of ANN-3) for the example.
- Neurons with 1 written inside them are the biases. A bias unit is an “extra” neuron added to each pre-output layer that stores the value of 1. Bias units aren’t connected to any previous layer and in this sense don’t represent a true “activity”. To get a clear understanding of why do we need biases, you can read this post.
- x_1 and x_2 are the inputs.
- (W¹_11) represents the weight assigned from x_1 to the first neuron in the hidden layer, (W²_12) represents the weight assigned from x_1 to the second neuron in the hidden layer and so on. The superscript denotes the weights are assigned to which layer like W¹, denotes weights are from the input layer to the 1st hidden layer, W² denotes weights are from the 1st hidden layer to the output layer.
- The arrow towards the right side is the final output(will be denoting it using y^) of our ANN architecture.
Neural network working process can be broken down in three steps:
1) → Initialization
The first step after designing a neural network is initialization:
- Initialize all weights(W¹_11,W²_12 …) with a random number from a normal distribution i.e~N(0,1). Keeping the weights closer to 0 helps the neural network to learn faster.
- Set all the biases to 1.
These are some of the most common initialization practices. There are various other weight initialization techniques, you can read this post for more info.
2) → Feedforward
Feedforward is the process neural networks use to turn the input into an output. We will be referencing the image given below for calculating y^.
Read the image from right to left.
- The first operation is multiplying the input vector including the bias with the W¹ matrix. Keep in mind that you might have to do some matrix transformation to make matrix multiplication feasible(convert input matrix to 1X3 and then multiply to W¹ to create an output of shape 1X2).
- After matrix multiplication, apply the sigmoid activation function.
- Till now we have calculated the value for the 1st hidden layer, now we will add a new bias to the hidden layer matrix(here it become 1X3 from 1X2). The next step is multiplying the hidden layer matrix to W².
- After matrix multiplication, we will be left with a single value that will be fed to the sigmoid activation function resulting in y^. This type of ANN architecture can be used in binary classification(if the output is greater than 0.5 we can predict 1 else 0).
Following the idea behind these steps, we can turn the input to output for an ANN even with 100 hidden layers.
Before starting Backpropagation step, we have to understand some concepts that we will require to explain Backpropagation. Concepts includes :
→ Gradient Descent
What is a Cost function?
Neural networks have gained so much popularity in the machine learning field because of their ability to learn and improve every time they are predicting their output. To be able to improve we need an error function or cost function that lets us know how far is our prediction from the actual output(A cost function is a measure of “how good” a neural network did with respect to it’s given training sample and the expected output). In neural networks, there are many cost function. One of the commonly used ones is the Bernoulli negative loglikelihood loss.
Ignore the “-” in front of L.
- N here denotes the number of training sample data points.
- p_i denotes the prediction by our model.
- y_i is the actual output.
- M denotes the number of output classes in a multi-classification problem.
The cost function results in an average loss of overall training points. By looking at the binary and multi-classification loglikelihood loss equation we can also conclude that they both results in the same cost(error) for M = 2. If you are interested you can read more about other costs function here.
What is Gradient Descent?
Gradient descent is an optimization algorithm used to minimize some function by iteratively moving in the direction of steepest descent as defined by the negative of the gradient(steepness of a slope or a curve, calculated using partial derivative). In machine learning, we use gradient descent to update the parameters of our model. Parameters include coefficients of Linear Regression and weights and biases of neural networks.
So let’s understand how does gradient descent helps neural networks to learn.
We will use the above image(Graph of cost function(J(w)) vs weights(w)) for explaining gradient descent.
- Suppose for particular weights and biases value, we get the following average cost(error) denoted by the ball in the graph for our training points. We can also observe that the value of the average cost can go as low as the J_min(w) by fine-tuning the parameters.
- To update weights and biases to minimize the cost, gradient descent will take small steps in the direction indicated by the arrows. It does it by calculating partial derivatives of the cost function with respect to weights and biases and subtracting it from the respective weights and biases. Since the derivative is positive where the ball is placed, if we subtract the derivative of the cost function with respect to w from w, it will decrease and go closer to the minimum.
- We continue this process iteratively until we get to the bottom of our graph, or to a point where we can no longer move downhill–a local minimum. This happens for all the weights and biases in a neural network for each training sample.
The size of these steps is called the learning rate. With a high learning rate, we can take bigger steps, but we risk overshooting the minimum since the slope of the curve is continuously changing. With a very low learning rate, we can confidently move in the direction of the negative gradient since we are recalculating it so frequently. A low learning rate is more precise, but calculating the gradient is time-consuming, so it will take us a very long time to converge.
3) → Backpropagation
The backpropagation algorithm is just an extension of that, using the chain rule to find the error with the respect to the weights connecting the input layer to the hidden layer (for a two-layer network). If you are not quite familiar with the chain rule concept, please do check the following link.
The main idea of backpropagation is to update the weights that could help us in reducing the cost of our network, thus generating better predictions. To update the weights to hidden layers using gradient descent, you need to know how much error each of the hidden units contributed to the final output. Since the output of a layer is determined by the weights between layers, the error resulting from units is scaled by the weights going forward through the network. Since we know the error at the output, we can use the weights to work backward to hidden layers.
Updating weights using Gradient Descent
We can update the weights with the help of the following equation:
where η is the learning rate and δ is an error term represented as
Remember, in the above equation (y−y^) is the output error, and f′(h) refers to the derivative of the activation function, f(h). We’ll call that derivative the output gradient. This Δwi is added to the respective weight which helps in reducing the error of the network, as explained in (Image of Graph of cost function(J(w)) vs weight(w)) .
For better understanding, we will consider a network in which the output layer has errors δ⁰_k attributed to each output unit k. Then, the error attributed to hidden unit j is the output errors, scaled by the weights between the output and hidden layers (and the gradient):
Then, the gradient descent step is the same as before, just with the new errors:
where wij are the weights between the inputs and hidden layer and xi are input unit values. This form holds for however many layers there are. The weight steps are equal to the step size times the output error of the layer times the values of the inputs to that layer.
Here, you get the output error, δoutput, by propagating the errors backward from higher layers. And the input values, Vin are the inputs to the layer, the hidden layer activations to the output unit for example.
Working through an example
The following image depicts a two-layer network. (Note: the input values are shown as nodes at the bottom of the image, while the network’s output value is shown as y^ at the top. The inputs themselves do not count as a layer, which is why this is considered a two-layer network.)
Assume we’re trying to fit some binary data and the target is y = 1. We’ll start with the forward pass, first calculating the input to the hidden unit. For understanding purposes, there is no bias in this network.
and the output of the hidden unit.
Using this as the input to the output unit, the output of the network is
With the network output, we can start the backward pass to calculate the weight updates for both layers. Using the fact that for the sigmoid function f′(W⋅a)=f(W⋅a)(1−f(W⋅a)), the error term for the output unit is
Now we need to calculate the error term for the hidden unit with backpropagation. Here we’ll scale the error term from the output unit by the weight W connecting it to the hidden unit. For the hidden unit error term,(Image of Equation-1), but since we have one hidden unit and one output unit, this is much simpler.
Now that we have the errors, we can calculate the gradient descent steps. The hidden to output weight step is the learning rate, times the output unit error, times the hidden unit activation value.
Then, for the input to hidden weights wi, it’s the learning rate times the hidden unit error, times the input values.
After calculating Δwi, add it to the respective wi. This is how the weights are updated to make the network better and better with every iteration.
From this example, you can see one of the effects of using the sigmoid function for the activations. The maximum derivative of the sigmoid function is 0.25, so the errors in the output layer get reduced by at least 75%, and errors in the hidden layer are scaled down by at least 93.75%! You can see that if you have a lot of layers, using a sigmoid activation function will quickly reduce the weight steps to tiny values in layers near the input. This is known as the vanishing gradient problem. You can read more about vanishing gradient problem here.