All thanks to Daniel Smilkov and Shan Carter who created an educational visualization on https://playground.tensorflow.org so that a naive person can quickly understand and clarify his concepts about deep learning. In this article, I will use TensorFlow playground to simulate the impact of changing neural network hyperparameters. It will help us to strengthen our deep learning concept.
First, we will start with understanding some of the terms by following the numbers from 1 to 8 depicted in the below picture.
We have six different data sets Circle, Exclusive OR (XOR), Gaussian, Spiral, plane and multi Gaussian. The first four are for classification problems and last two are for regression problems. Small circles are the data points which correspond to positive one and negative one. In general, positive values are shown in blue and negative in orange.
In the hidden layers, the lines are colored by the weights of the connections between neurons. Blue shows a positive weight, which means the network is using that output of the neuron as given. An orange line shows that the network is assigning a negative weight.
In the output layer, the dots are colored orange or blue depending on their original values. The background color shows what the network is predicting for a particular area. The intensity of the color shows how confident that prediction is.
We have seven features or inputs (X1, X2, squares, product and sine). We can turn on and off different features to see which features are more important. It is a good example of feature engineering.
Epoch is one complete iteration through the data set.
4- Learning Rate
Learning rate (alpha) is the hyperparameter which is responsible for the speed to approach the local optima.
5- Activation Function
An activation function is a node that you add to the output layer or between two layers of any neural network. It is also known as the transfer function. The activation functions determine what causes the neuron to fire. The activation functions are basically of two types linear & non-linear.
The purpose of regularization L1 and L2 is to reduce overfitting. Overfitting is when a model works very well for the data that it was trained on but the model provides a poor prediction for any data it hasn’t seen before.
7- Neural Network Model or Perceptron
A neural network model is a network of simple elements called neurons, which receive input, change their internal state (activation) according to that input, and produce output (0 or 1) depending on the input and activation. We have one input, one output and at least one hidden layer in the simplest neural network called shallow neural network. When the hidden layers are 3 or more then we called it a deep neural network. Each hidden layer has actual working elements called neurons that take input from features or predecessor neurons and calculate a linear activation function (z) and an output function (a).
For details check: https://engmrk.com/module-14-artificial-neural-network/
8- Problem Type
We have four data sets for classification and two for the regression problem. We can select the type of problem we want to study.
A- Why we increase neurons in the hidden layer?
We will start with the basic model (shallow neural network) with a single neuron in the hidden layer. Let’s pick the dataset ‘Circle’, features ‘X1’ & ‘X2’, 0.03 learning rate and ‘ReLU’ activation. We will press the run button and wait for the completion of a hundred epochs and then click pause.
The test and training loss is more than 0.4 even after completion of 100 epochs. Now we will add four neurons in our hidden layer using the plus button and run again.
Now, our test and training loss is less than 0.02 and the output is very well classified into two classes (orange and blue colors). The addition of neuron in hidden layer provides flexibility to assign different weights and parallel computation. However, adding neurons after a certain extent shall be computationally expensive with little benefit.
B- Why use a non-linear activation function for classification problems?
In a neural network, we use only non-linear activation functions for all classification problems because our output label is between 0 and 1 whereas linear activation functions may provide any number between -infinity to infinity. In result, the output will not be converged at any time.
In the above video, we ran the same model but with linear activation and it is not converging. The test and training accuracy is more than 0.5 after 100 epochs.
Also, check: https://engmrk.com/activation-function-for-dnn/
C- Why we increase the hidden layers?
Now, we will add one more hidden layer with two neurons and press the run button.
Our test and training accuracy reduced below 0.02 in only 50 epoch, almost half as compared to the single hidden layer model. Similar to neurons, adding hidden layers will not be a good choice for all cases. Sometimes, it becomes computationally expensive without adding any benefit. It is very well explained in the below video where we added six hidden layers with two neurons in each layer. Even after 100 epoch, we couldn’t achieve good results.
D- Why ReLU activation is a good choice for hidden layers?
Rectified linear unit (ReLU) is a preferred choice for all hidden layers because its derivative is 1 as long as z is positive and 0 when z is negative. In some cases, leaky rely can be used just to avoid exact zero derivatives. On the other hand, both sigmoid and tanh functions are not suitable for hidden layers because if z is very large or very small, the slope of the function becomes very small which slows down the gradient descent.
We will run the training with different activation functions (ReLU, sigmoid, tanh and linear) and we will see the impact.
In the above video, it is clear that ReLU outperforms all other activation functions. Tanh performs well with our selected dataset but not as efficient as ReLU. This is the reason that ReLU is so popular in deep learning.
E- What is the impact of adding/reducing or changing input features?
All available features are not helpful for modeling the problem. In fact, using all features or unrelated features shall be computationally expensive and may impact the final accuracy. In real-life applications, it may take a lot of trial and error to figure out which features are most useful for modeling the problem. We will demonstrate this by using the different features in our model.
By changing the input features from linear to square, we achieved less than 0.01 accuracy in 40 epochs only. On the other hand, the product and sine features were not really helpful. The process of selecting the best input features is called feature engineering.
You can access this educational playground and experiment a little bit more with the data sets and the different functions. Once again, we are thankful to the authors and all contributors of this tool as they have open sourced it on GitHub with the hope that it can make neural networks a little more accessible and easier to learn.
This article was originally published at https://engmrk.com.
Source: Deep Learning on Medium