A Brief Introduction to Deep Learning

Source: Deep Learning on Medium

A Brief Introduction to Deep Learning

Deep Learning is a powerful branch of Machine Learning that initiates the working of human brain. Deep Learning is the science behind self driving cars, classifying images, texts or audio. It plays a key role in devices like Alexa, Google Home.

Today Deep Learning has its applications in almost every industry ranging from e-commerce to health care and it continues to set new records. Deep Learning models can achieve accuracy, sometimes that the human brain also cannot achieve. This is because Deep Learning models are trained on very large data sets and contain artificial neural network with many layers.

In this article we’ll try to cover some of topics related to Artificial Neural Networks (ANN).

“Artificial Neural Network (ANN) is a computational model that replicates the behaviour of biological nervous system consisting of brain to solve a specific problem. This system comprises of large number of interconnected neurons that help in processing information and solving the problem.”

Topics covered :

  1. Neurons
  2. Activation Function
  3. How does a Neural Network work
  4. How does a Neural Network learn
  5. Gradient Descent
  6. Stochastic Gradient Descent
  7. Training ANN with Stochastic Gradient Descent


A neuron is a nerve cell which is fundamental building block of nervous system. They are similar to other cells in human body but with the key difference that they are responsible for transmitting information throughout the body.

Cell body or Soma: Unlike other cells soma contains nucleus, mitochondria and other components of the cell. It carries out activities that are responsible for the life of the neuron.

Dendrites : Dendrites are thread like projections of a neuron. They branch out like a tree and receive incoming signals(information) from other neurons.

Axon : It’s a long tubular structure that carries information received from the dendrite away from the cell body.

Synapse : These are complex structures at the end of axon. These structures connect with dendrites of other neurons.

The information received from the dendrites is processed by the soma and output is generated. This output is carried to synapse via axon from where it is transmitted to the dendrites of other neurons.

The below image represents a single layer ANN. It is also called Perceptron.

In the above figure, x1,x2,x3…xm represents the input variables.The input variables must be independent of each other and should be standardised.

w1,w2,w3…wm are the weights assigned to synapse. These weights are multiplied with respective input value. After this step the weighted sum of input values is calculated.

mathematically, x1.w1+x2.w2+x3.w3….+xm.wm = ∑ xi.wi

now the activation function is applied to the weighted sum , 𝜙(∑ xi.wi)

Activation Function

Activation functions are really important for an Artificial Neural Network to learn and make sense of something which is really complicated. They add non linear properties to out Neural Network.

The main motive behind activation function is to convert input signal of a node to output signal. This output now can be used as an input to the next layer.

What happens without activation function?

A Neural Network without activation function would simply be a Linear Regression Model with a linear function of one degree polynomial. A linear function is easy to solve but is less complex and has less power to learn from complex data. We want our Neural Network to not just learn and compute a linear function but something more complicated than that.

Also without activation function our Neural network would not be able to learn and model other complicated kinds of data such as images, videos , audio , speech etc

Why do we need non linearities?

Non-linear functions are those which have degree more than one and they have a curvature when we plot a Non-Linear function. Now we need a Neural Network Model to learn and represent almost anything.

Neural Network is considered to be “Universal Function Approximators”. It means they can learn and compute any function.

Popular Activation Functions :

  1. Threshold Activation Function
  2. Sigmoid Function
  3. Hyperbolic Tangent Function
  4. Rectifier Function

1.Threshold Activation Function (Binary step function):
A Binary step function is a threshold-based activation function. If the input value is above or below a certain threshold, the neuron is activated and sends exactly the same signal to the next layer.

Binary Step Function

2. Sigmoid Activation Function (Logistic function) :
A Sigmoid function is a mathematical function having a characteristic “S”-shaped curve or sigmoid curve which ranges between 0 and 1, therefore it is used for models where we need to predict the probability as an output.

3. Hyperbolic Tangent Function(tanh) :
This function is also like logistic sigmoid but better in performance. The range of this function is from (-1 to 1). It is also sigmoidal (s — shaped). The tanh function is mainly used for classification between two classes.

3. Rectifier Function (Relu) :

ReLu is the most used activation function in Neural Networks which ranges from zero to infinity [0,∞]. It’s just R(z) = max(0,z) i.e if x < 0 , R(z) = 0 and if z >= 0 , R(z) = z. Hence, seeing the mathematical form of this function we can see that it is very simple and efficient.

Limitations of Relu :

  1. But its limitation is that it should only be used within Hidden layers of a Neural Network Model.
  2. It can cause a weight update which will makes it never activate on any data point again. Simply saying that ReLu could result in Dead Neurons.

To fix the problem of dying neurons, Leaky ReLu function was introduced. Leaky ReLu introduces a small slope to keep the updates alive. Leaky ReLu ranges from -∞ to +∞.

How does a Neural Network work ?

For better understanding of this topic let us take the example of price of a property and to start with we have different factors assembled in a single row of data: Area, Bedrooms, Distance to city and Age.

The input values go through the weighted synapses and then to the output layer. An activation function will be applied to all the weighed input values and the output will be generated.

This is a simple Neural Network, its accuracy can be increased by adding hidden layers that sit between the input and output layers.

Now in the above figure, we have added a hidden layer between input and output layers.

The input variables are connected to neurons via a synapse. However, not all of the synapses are weighted. they will either have a zero value or non-zero value.

The non-zero value, indicates neuron is important

zero value indicates neuron is not important and will be discarded.

Let’s take the example of Area and Distance to City are non-zero for the first neuron, which means they are weighted and matter to the first neuron. The other two variables, Bedrooms and Age aren’t weighted and so are not considered by the first neuron.

You may wonder why that first neuron is only considering two of the four variables. In this case, it is common on the property market that larger homes become cheaper the farther they are from the city. That’s a basic fact. So what this neuron may be doing is looking specifically for properties that are large but are not so far from the city.

Now, this is where the power of neural networks comes from. There are many of these neurons, each doing similar calculations with different combinations of these variables. Once this criterion has been met, the neuron applies the activation function and do its calculations.

This way the neurons work and interact in a very flexible way allowing it to look for specific things and therefore make a comprehensive search for whatever it is trained for.

How does a Neural Network learn?

Once the Neural Network model has been built and output is generated, we take the difference between actual value and the predicted value. An error value known as Cost Function is calculated.

Cost Function: One half of the squared difference between actual and output value.

For each layer of the network, the cost function is analysed and weights adjusted for the next input. Our aim is to minimize the cost function. The lower the cost function, the closer will be the predicted value to the actual value. In this way, the error keeps becoming marginally lesser in each run.

We feed the resulting data back through the entire neural network. As long as there exists a disparity between the actual value and the predicted value, we need to adjust those wights. Once we tweak them a little and run the neural network again, A new Cost function will be produced, hopefully, smaller than the last.

We need to repeat this process until we scrub the cost function down to as small as possible.

The procedure described above is known as Back-propagation and is applied continuously through a network until the error value is kept at a minimum.

Gradient Descent

Gradient Descent is an optimisation technique that is used to improve neural network-based models by minimising the cost function.

Gradient descent is an iterative algorithm, that starts from a random point on a function and travels down its slope in steps until it reaches the lowest point of that function. This process occurs in the backpropagation phase where the goal is to continuously resample the gradient and adjust the weights until minimum cost function is obtained.

Stochastic Gradient Descent

The word ‘stochastic‘ means a system or a process that is linked with a random probability. Hence, in Stochastic Gradient Descent, a few samples are selected randomly instead of the whole data set for each iteration.

In SGD, we take one row of data at a time, run it through the neural network then adjust the weights. For the second row, we run it, then compare the Cost function and then again adjusting weights. And so on…

SGD helps us to avoid the problem of local minima. It is much faster than Gradient Descent because it is running each row at a time and it doesn’t have to load the whole data in memory for doing computation.

SGD usually takes higher number of iterations to reach the minima, because of its randomness in its descent. Even though it requires a higher number of iterations to reach the minima than typical Gradient Descent, it is still computationally much less expensive than Gradient Descent.

Training ANN with Stochastic Gradient Descent

Step-1 → Randomly initialize the weights to small numbers close to 0 but not 0.

Step-2 → Input the first observation of your dataset in the input layer, each feature in one node.

Step-3 → Forward-Propagation: From left to right, the neurons are activated in a way that the impact of each neuron’s activation is limited by the weights. Propagate the activations until getting the predicted value.

Step-4 → Compare the predicted result to the actual result and measure the generated error(Cost function).

Step-5 → Back-Propagation: from right to left, the error is backpropagated. Update the weights according to how much they are responsible for the error. The learning rate decides how much we update weights.

Step-6 → Repeat step-1 to 5 and update the weights after each observation(Reinforcement Learning)

Step-7 → When the whole training set passed through the ANN, that makes and epoch. Redo more epochs.

That’s all for this article. In the next article we’ll create a Artificial Neural Network using python from scratch.

Thanks for reading!!