Artificial Neural Networks: Explained

Original article was published on Artificial Intelligence on Medium

Artificial Neural Networks: Explained

Typically when we say Neural Network, we are referring to Artificial Neural Networks (ANN). And though they may sound complex, they are actually quite easy to understand.

We all love shopping and face a dilemma about whether to buy a product or not. To decide, we take many factors into consideration like “Do I really need it?”, “Is it worth the money?”, “How would it look?”, etc. Such factors are all associated with some weightage of their own. For example, we might consider the need to buy the product to have more weightage than the way it looks.

Similarly, a Neural Network has several inputs and each of them is associated with a weight. A set of such inputs is termed as the Input Layer, and certain output (decision) is expected from the network, which is determined by the Output Layer.

Image by author

Here, x’s form the input layer consisting of independent variables, y’s form the output layer consisting of dependant variables, and weights (w’s) are represented by edges.

But while developing an AI that decides whether to buy a product, the weightage may be dependant on the category of items. For example, the color of the product may be a major deciding factor while shopping for clothes, but it may not matter much while say, shopping for electronics. So the weights must adapt according to the inputs.

To achieve this, we introduce additional nodes (neurons) in between the input and output. By doing this, we give our network a way to adapt, as weights for each input will be different for each neuron. These neurons together form the Hidden Layer. The hidden layer is called so because this is the intermediate layer with which the user does not have direct contact, while the user has contact with input and output layers.

The input layer, the hidden layers, and the output layer together form the Neural Network.

To make the network more adaptive, the number of neurons in the hidden layer can be increased, or even the number of hidden layers can also be increased. Each hidden layer in the network will act as input to the next layer with its own weights.

Image from Deep Learning A-Z by SuperDataScience

How are these weights decided?

Like kernels in CNN, weights are the actual intelligence factor in ANN. They can be initialized randomly or by some predefined functions, but they change over time as the network is trained. More on this up ahead.

What is training?

As a child needs to be trained before it can walk, a neural network needs to be trained before it can be used to predict. The process of how neural networks train is very intuitive and consists of :

  1. Forward propagation.
  2. Calculating loss.
  3. Backpropagation.

Forward Propagation

Forward propagation is like telling the neural network to apply its knowledge. How so? Let me explain…

It is just simple multiplication and addition.

As I mentioned earlier, we need to take all the inputs into consideration according to their weights. So we multiply each input value with its corresponding weight which gives us the amount of effect that the input parameters have on the decision. We sum all of them up to find the resultant of all inputs at that neuron.

Image from Deep Learning A-Z by SuperDataScience

Here, the resultant will be given as (x1*w1) + (x2*w2) + … + (xm*wm).

But aren’t there many such neurons?

Yes, there are! And there will be just as many resultant values as there are neurons in that layer.

So how do we choose which resultant to consider?

The selection of values is done by something called the Activation Function. This is a complex topic, and I’ll be writing a separate article about it. But for now, just remember that the activation function selects the best neurons.

This process of multiplying and adding continues until we reach the final layer, i.e. the output layer, where we get our actual prediction.

Calculating losses

Okay, we have some prediction with us, which initially is going to be a really bad one as the weights are randomized. So to improve the predictions, we need to see how far off our predictions are…

Loss is a measure of the difference between the expected value and the predicted value.

The loss can be calculated by various loss functions (also called cost functions) like Mean Squared Error (MSE), Mean Absolute Error (MAE), Root Mean Squared Error (RMSE), or even you may define your own function to calculate losses.


This is the actual learning stage of our network. In backpropagation, we take the losses from the previous step and go move along the network from back to front, while adjusting the weights, to give better predictions.

How does the network know what to change the weights to?

I love this bit… There could be an innumerable number of weights that the network can use and it will take years to search for the best amongst them. To avoid this, we use a method called Gradient Descent.

Gradient Descent is a method to find weights that result in minimum loss.

Imagine thousands of such weights and losses associated with them. The plot of such weights against the cost would look like a U.

Image from Deep Learning A-Z by SuperDataScience

The point at the bottommost position has a minimum loss (called minima), and we need to find the set of weights corresponding to that loss. At either sides of that point, as weights deviate, the losses increase, hence the U graph.

While training, we have to find this minima from amongst all the weights, we start with a random set of weights. The working of gradient descent is like that of a pendulum. Imagine you drop a ball attached to a string, and it keeps moving. The point at which it is left is like the random weight we selected. Now the ball moves past the minima to some point on the other side of it. Similarly, in gradient descent, we do not know the minima at first, so we calculate the slope of the point at which we are right now, to get the direction in which the minima is present and make a jump in that direction.

The jump may be big and we may move past the minima as the ball did, but eventually, as we give our network more samples to train upon, we will arrive at the minima.