Deep learning (DL) has become a common word in any analytic or business intelligence project discussions. It belongs to a broader Artificial intelligence field of study and part of machine learning algorithms to be specific. These models are purely based on learning patterns and representations found in the given data (understand the data patterns vs. fitting a line, hyperplane or a decision boundary) compared to task-specific algorithms. Learning can be supervised, semi-supervised and unsupervised.
DL models play a vital role in computer vision, speech recognition, natural language processing, bioinformatics, drug-design and machine translations (translation from one human language to another; i.e., English to Hindi) to list a few.
In simple terms, most deep learning models involve stacking multiple layers of neural nets in a particular architectural layout for either a prediction or classification problem (Reinforcement and Generative architectures deal with a different set of real-world problems). Neural nets are versatile, robust and scalable and they can handle high dimensionality tasks with ease (an extreme number of feature set; i.e., in object recognition — identify whether an image contains a cat or dog, each pixel colour channel will be a feature; a 120×120 image leads to a matrix of 14400 pixels and multiply that by three for RGB channel intensity. we will end up with 43200 features to start with)
Before the rise of neural nets in mid-2010, support vector machines used to play a significant role in high dimensionality predictive problems like text classification and speech recognition.
In a traditional classification task (i.e., predict whether a patient will be diagnosed with a disease based on the given list of symptoms and family health records; The output is always either a yes or no and also the propensity of the output), the objective is to find the decision boundary which separates the target variable’s categories (disease state : yes or no), a logistic regression works well when the data is linearly separable but fails to understand the non-linear relationship. SVM employs kernel tricks and maximal margin concepts to perform better in non-linear and high-dimensional tasks. Even a powerful SVM model, most of the times, benefit from the proper feature selection and feature extraction/transformation techniques.
Artificial Neural Net concept was not something new to the computer science world. It was first proposed by Warren McCulloch, and Walter Pitts in 1943 and the United States Office of Naval Research tasked Frank Rosenblatt in 1957 to build the perceptron (neural net) algorithm. A single layer perceptron did not perform up to the expectations as it could only capture limited linear patterns, stacking two or more neural layers (feedforward neural net or multilayer perceptron) improved the performance but still cannot predict an XOR function.
Marvin Minsky and Seymor Papert in their book entitled Perceptrons showed that it was not possible for these networks to model simple XOR function in 1969. For many years the book’s citation kept the progress in the ANN area very limited to none. It was only in the 1980s the algorithm resurged into active research, and in 2012 Geoffrey Hinton demonstrated the use of generalized backpropagation algorithm for training multi-layer neural nets in the Imagenet challenge which revolutionized the field of deep learning.
Growth in DL usage should also be attributed to the enabling fields. Data processing front saw groundbreaking changes in Mid 2010. Hadoop distributed ecosystem changed the way in how data is processed and stored. Single core processor’s processing power has increased manifold compared to processors in 1980s, and The emergence of the Internet of Devices made a vast amount of data collection possible which provided the much-needed training data for neural nets. Graphical Processing Units perform well in matrix multiplication compared to a multi-core processor, and neural nets heavily depend on matrix operations to fulfill their necessary calculations. Acknowledgments to all the gamers across the world because of them, now neural nets can be trained much faster on GPUs. Without your relentless effort and resolute, there will be no better GPUs in this world.
The fundamental unit of a neural net is a single neuron which was loosely modeled after the neurons in a biological brain. Each neuron in a given layer (i.e., layer 1) will be connected to all or as many neurons in the next layer (i.e., layer 2). The connections between neurons mimic the synapses in the biological brain. A neuron will only fire an output signal if it has received enough input signal (in magnitude to cross a set threshold) from its predecessors.
List of techniques which improved neural nets performance over time that helped it to beat SVM:
1. Backpropagation: A multilayer perceptron(MLP) have an input, hidden and output neural layer. Training an MLP is an insurmountable task until in 1986 Rumelhart published an article introducing Backpropagation training algorithm (also known as Gradient Descent using reverse-mode autodiff). For each training record (data point) the algorithm calculates the neuron output from each layer and then finally in the output layer makes a prediction(forward pass), based on how far the prediction is off from the actual output it calculates the prediction error. The prediction error is then used to change the weights of the neurons in all the previous layers (backpropagation) until it reaches the input layer to improve the overall networks prediction accuracy.
2. Number of hidden layers and neurons per hidden layer: A single layer neural net can give reasonable results but stacking them together improves the learning capacity of the network. A multilayer neural net for face detection will outperform a single layer neural net. When stacked the lower layers can capture the lower-level details (i.e., the lines separating the face from the background), the middle hidden layer can capture mid-level details (i.e., squares and circles) and the output layer can detect the high-level features (i.e., pixel location of the eye). Adding more layers and more neurons per layer will lead to model overfitting, greater training time and Vanishing/Exploding gradients problem so these parameters will require careful considerations.
3. Activation functions (Vanishing and exploding gradients — non-saturating activation functions): An activation function decides when a neuron will fire and the magnitude of the output based on the input signals from the predecessor. It can be a sigmoid, tanh, softmax or a ReLU variant. It is common to use ReLU (Rectified Linear Unit) as the activation function for input and hidden layers. For the output layer either a softmax if it is a classification task or the actual value if it is a prediction. When RELU is used in a deep layered neural net, the backpropagation signal will either diminish to zero or explodes into a large number when it reaches back the input layer, with no proper backpropagation signal the weights will never change in the lower layers. Variants of ReLU comes to rescue. Leaky ReLU, Randomized leaky ReLU, Parametric leaky ReLU and Exponential Linear Unit (ELU). Performance tests have shown the following order of preference.
4. Batch normalization: Sergey Ioffe and Christian Szegedy proposed BN in their paper in 2015 to tackle the vanishing and exploding gradients problem. Just before the activation function of each layer, zero-center and normalize the inputs, then scaling and shifting by two new parameters (one for scaling, the other for shifting). This lets the model learn the optimal scale and mean of the training data in each layer
5. Reusing pre-trained layers (Transfer Learning): The lower layer weights of a pre-trained model can be reused instead of training a new model from scratch. If we are building a model to identify a dog’s bread, then we can use the lower layer weight of the model which determines whether an animal in an image is a dog or not
6. Faster optimizers: Optimizers calculate the backpropagation signals, and this helps the net in adjusting neuron weights across all layers. The performance and speed of the optimizer have a direct impact in the training speed of the net. Momentum optimization by Boris Polyak in 1964 was the forefather of all optimizers. Later came Nesterov Accelerated Gradient, AdaGrad, RMSProp and Adam optimization. Adam performs better than other optimizers
7. Learning Rate scheduling: It is critical to find the right learning rate. A smaller learning rate will take forever to reach the optimum solution, and a larger learning rate will swing across the boundary instead of reaching the optimum. Instead of a constant learning rate, it is highly recommended to use a high learning rate during the start and reduce it during training. Typically optimizers should take care of this for the users.
8. Early stopping and l1 and l2 regularization: Stop training the network when the performance actually drop compared to previous epochs. Regularizations of neuron weight (not the biases) using l1 or l2 norm help in avoiding the network overfitting to the training data
9. Dropout: This concept was proposed by Geoffrey Hinton in 2012, and it has helped the networks from overfitting. At every training iteration, the neurons in all the layer including input have the probability of p to get dropped out of the network training. This technique leads to a new architecture trained in each iteration and leads to improving the model accuracy without overfitting to training data
10. Data augmentation: Labeled data is more valuable than any precious metal in the DL land. Each network will require a significant amount of labeled data for it to train (i.e., In object detection for cat vs. dog in a given image, we need labeled images for training — were images are tagged as either cat or dog by a human). However, when we have enough labeled training data it is possible to add some modifications to the labeled data point to generate more labeled training data (i.e., by rotating the cat image by an angle or changing the pixel intensity of a few pixels).
Research article trend from academic.microsoft.com to identify the leading algorithm:
Let’s look at the published article trend for the neural nets vs support vector machines starting from 2000. There is a significant uptake in the article volume for the neural nets, and they have surpassed SVM significantly in active research in the last seven years.
I hope the scuffle between the machine learning algorithms leads to better and intelligent products to serve the human endeavours.
Source: Deep Learning on Medium