Neurons, Activation Functions, Back-Propagation, Epoch, Gradient Descent: What are these?

Original article was published on Artificial Intelligence on Medium

Deep Learning: Why Now

Deep Learning and Neural Networks are here since the ’80s but not big leaps were done until recently due to the lack of strong processing power and data. Back in the ‘80s or ’90s storage memory was limited and expensive which meant we can’t load a lot of data to process. Also, CPU’s & GPU’s were very expensive and unaffordable. Nowadays you can spin up a high-end NVIDIA GPU and few Gigabytes of memory on AWS or GCP or Azure cloud for a few dollars.

Until Recently Processing Power & Memory weren’t sufficient for Deep Learning. Source.

To give you an idea of how expensive CPUs and memory were, just look at these ads from the mid-’90s. 16MB memory with 120MHz CPU for over $2000 was top of the line in the mid-’90s. Now lookup GPUs and memory on Amazon or at your local best buy and see how big of storage and strong processors are available for very affordable prices.

Deep Learning: What is it

Deep Learning is based on Neural Networks and they are meant to mimic the human brain which is the most powerful tool on the planet.

Deep vs. Shallow Neural Networks. Source.

These Neural networks are made of many Neurons, they have an input layer, an output layer, and hidden layers. Each input, output, or hidden node is a Neuron and has an Activation Function and serves a purpose for Neural Networks to learn. We will get into how they learn later in this article.

The more hidden layers the more Deep the Neural Network is. There are different opinions on what makes a Neural Network shallow vs. deep but the role of thumb is the more hidden layers the deeper the Neural network is, few hidden layers (some people say 2 or 3 layers) makes it shallow.

Upon learning these Neurons will have different weights, these weights are adjusted during learning. The Neurons’ weight adjustment is the result of learning. The weights get adjusted using Back-Propagation and Gradient Descent which is covered later in this article.

Deep Learning vs. Machine Learning

Deep Learning is a subset of Machine Learning, Every Deep Learning algorithm is considered Machine Learning but not every Machine Learning algorithm is considered Deep Learning.

AI vs. ML vs. DL. Source.

The biggest difference between Machine Learning and Deep Learning is how they learn.

Machine Learning: You select the model (a certain Classifier for example) to train and manually perform feature extraction for the model to learn.

Deep Learning: You select the architecture of the Neural Network (number of layers, activation function) and features are automatically extracted from the fed labeled training data.

To understand this better let’s take an example of classifying cars and buses. In Machine Learning you will define which classifier you want to use then perform features extraction, for example, you teach the model that cars’ features are certain dimensions, 2 or 4 doors, 4 tires, 4 windows, etc. For buses, you teach the model that buses features are certain dimensions, 1 or 2 doors, 8 tires, 10 windows, etc. Here you manually extracted the features and fed them into your model.

Machine Learning vs. Deep Learning

In Deep Learning, you don’t need to do feature extraction or even know the features. You start by selecting your network for example Convolution Neural Networks (CNN) since we will work with images. Define the number of layers, and the activation function to be used then point your network to a folder of labeled images of cars and buses. The Neural Network on its own will go through the pictures and capture features automatically for cars and buses. That’s it, your Neural Network taught itself how to differentiate between a car and a bus.

Activation Function

Activation Function is what happens in the Neurons, each Neuron has an activation function that works when the Neuron is fired up.

Activation Function in Neural Networks. Source.

The Neuron’s input will pass through the activation function, gets procced then sent to the next layer or output Neuron. It’s the means of which the neural network learns and at the end decides what is to be fired to the next neuron.

There are many different activation functions but I will cover the main five here.

Threshold (Binary Step) Activation Function. Source.

Threshold Function

Also known as the binary step function, it is a threshold-based activation function. If the input value is above or below a certain threshold, the Neuron is activated and sends exactly the same signal to the next layer. It’s kinda a yes or no function.

Sigmoid (Logistic) Activation Function. Source.

Sigmoid Function

Mainly used for logistic regression, it’s smoother than the threshold function. It is also very useful at the output layer of the Neural Network. Some data scientists complain that it is computationally expensive.

ReLU Activation Function. Source.

Rectifier (ReLU) Function

It is one of the most popular functions for Neural Networks. It can help mitigate vanishing and exploding Gradient Descent issues. ReLU is a non-linear function and very computationally efficient.

Hyperbolic Tangent (Tanh) Function

It’s similar to Sigmoid but it goes below zero, unlike Sigmoid. It’s mainly used when the input has strongly negative, neutral, or strongly positive values.

Softmax Activation Function. Source.

Softmax Function

This is mainly used in classification problems for multi-class predictions. It is typically in the output layer of the Neural Network.

Different Activation Functions in the same Neural Network. Source.

You could apply different activation functions for hidden layers and output layers. In this diagram, ReLU is applied for hidden layers while Sigmoid is applied for the output layer. This is common to predict the probability of something.

Neural Network: How Do They Learn

The learning happens by passing labeled data all the way from the input layers to the output layers then all the way back. Since the data is labeled the Neural network knows what’s the expected output and compares it to the actual output of the Neural Network.

In the first Epoch, the labeled data is entered at the input layer and propagated to the output layer where your Neural Network will calculate an output. The difference between the actual output of your Neural Network vs. the expected output is called the Cost Function. The goal of your Neural Network is to decrease this Cost Function as much as possible. So, your Neural Network will Back-Propagate from the output layer all the way to the input layer and update the weights of the Neurons accordingly in an attempt to minimize this Cost Function.

The act of sending the data from the input layer to output layer then all the way back is called an Epoch. In each epoch, the Neural Network updates the weights of the Neurons which is also known as Learning. After multiple Epochs and weight updates, the loss function (the difference between Neural Network output vs. Actual output) should reach a minimum.

Neural Networks During Learning Phase

Upon learning, the Neurons will have different weights and those weights will dictate future outputs. For example in our earlier car vs. bus classification scenario a Neuron could be looking at the number of windows to decide if the object is a car or bus, obviously, this Neuron will have higher weight than a Neuron looking at the color of the object to determine if it’s a car or a bus. This is an oversimplification of the Neurons function but I want you to get the idea of Neuron weights according to importance.

Gradient Descent

Gradient Descent. Source.

Gradient Descent is a method to minimize the Cost Function in order to update the Neurons weights. This method tells your Neural Network how to calculate the Cost Function in a fast efficient manner to minimize the difference between the actual and expected outputs.

The easiest to understand and most common example is comparing your Cost Function to a ball trying to find the lowest point by updating its slope.

Stochastic Gradient Descent

Stochastic Gradient Descent builds on top of Gradient Descent where it can work with complicated Cost Functions.

Gradient Descent Stuck in Local Minima and Misses True Minima. Source.

The Gradient Descent works well only in case of convex cost functions with one minimum only. However, in the case of complicated Cost Functions, the Gradient Descent can easily get stuck in local minima which ruins your Neural Network learning.

Stochastic vs. Gradient Descent

To understand how Stochastic is different from Gradient Descent let’s take an example. We will assume you have labeled data as rows and you’re inputting them into your Neural Network for training.

Gradient (Batch) Descent when your Neural Network goes through the data one row at a time and calculate the actual output for each row. Then after finishing all rows in your dataset, the Neural Network compares the cumulative total output of all rows to the expected output and backpropagates to update the weights. This means the Neural Network updates the weights once after working through the entire dataset as one big batch, That’s a big timely Epoch. The Neural Network will do this several times to train the network.

Gradient Descent vs. Stochastic Gradient Descent. Source.

Stochastic Gradient Descent when your Neural Network goes through the data one row at a time and calculate the actual output for each row. Right away the Neural Network compares the actual output of the first row to the expected output and backpropagates to update the weights, that completes an Epoch. Then the same happens for the second row, comparing outputs and backpropagating to update weights. All the way to the last row, so multiple Epochs and multiple weight updating happen to go through the entire dataset rather than treating it as one big batch like in the case of Gradient Descent. This helps to avoid local minimums and it’s faster than Gradient Descent cause it doesn’t need to load all data in memory and run through them at once rather it loads one row at a time and update weights.

There is the best of both worlds method that’s called mini-batch Gradient Descent which is basically combining both. You decide how many rows to run and update at once. So instead of running the whole dataset as one batch or instead of running one row at a time, you have the flexibility to choose any number of rows to run.

Back-Propagation

By now you should know what back-propagation is if you don’t then it’s simply adjusting the weights of all the Neurons in your Neural Network after calculating the Cost Function. Back-Propagation is how your Neural Network learns and its the result of calculating the Cost Function. The important concept to know is that Back-Propagation updates all the weights of all the Neurons simultaneously.

For training purposes, in the beginning, the weights of the Neurons are randomly initialized with small numbers then through learning and back-propagation the weights start to get updated with meaningful values.