Source: Deep Learning on Medium
What are Neural Networks?
Neural networks are algorithms (modeled after the human brain) used to recognize patterns in a data set. They take input data, process the data through the hidden layers, and return output. In other words, they map inputs to outputs, and learn to approximate a function between any input x and any output y. Additionally, they can be trained on both labeled and unlabeled data; in fact, one of the main advantages of neural networks and other deep learning models is that they do not need labeled data, and thus can perform unsupervised learning. Unsupervised learning is essentially the process of a model grouping similar objects together, but not knowing what those objects are. On the other hand, a model does supervised learning when it is given data and labels corresponding to that data, and then learns correlations between the data and the labels.
Neural networks can be highly efficient for classification (a form of supervised learning) and clustering (a form of unsupervised learning) tasks. For example, a neural network can be trained to classify images of dogs and cats (specifically convolutional neural networks). Each image in the training data set is represented as n × n pixels. Each pixel can then be represented as a number corresponding to the color/darkness of that pixel (e.g. 0 to 255 for a color pixel or 0 to 1 for a grayscale pixel). As a result, the neural network will have a total of n^2 features as inputs. We will also need to make sure that each image is labeled with a category (e.g. whether it is a cat or a dog; that way, the network can learn associations between the images and the labels).
However, neural networks can also be used for other, typical classification problems (e.g. predicting 0 or 1). In such cases, the number of features in the training data would not be pixels, but simply the number of columns in the data set (assuming that categorical data has been encoded, feature extraction has been carried out, etc.).
A neural network consists of fundamental building blocks called layers, which work together to analyze increasingly broad features in the training data, allowing the model to learn patterns. In the case of image classification, the first layer is responsible for reshaping the image data from an n × n array to a n^2 × 1 array. In other words, the first layer takes rows of pixels in the data and puts them in a long line. But in the case of typical numerical classification, the information in each row of the training data set is put into an n × 1 array, where n is the number of features.
After adding the first layer, we will want to add layers that are fully-connected or densely-connected. These layers will analyze different types of patterns in the training data. Each node in these layers multiply the input values by particular numbers called weights, sums it with other values coming into that node, adjusts the result by the neuron’s bias, and then normalizes the output with an activation function. The last layer will contain the model’s output. The size of the output array is equal to the number of categories a data point could be classified as. If performing binary classification (1 or 0 / True or False), the output layer will be a 2 × 1 array. If classifying images of humans as infants, children, adults, and the elderly, the output will have dimensions 4 × 1, and so on.
The output contains the probabilities that a data point belongs to a particular category. For example, when shown a cat, a neural network may calculate that the image has a 96% probability of being a cat and 4% probability of being a dog. If shown an infant, the network may predict that the image has a 94% probability of being an infant, 3% a child, 2% an adult, and 1% an old person. If a neural network is predicting fraud, it may predict that a particular financial transaction has a 92% probability of being fraudulent, and an 8% probability of being non-fraudulent.
Additionally, it is important to note that a neural network is considered a deep learning model only if it has more than one hidden layer. Neural networks with multiple hidden layers are able to learn much more sophisticated trends and patterns in a data set than those with one hidden layer. In other words, multiple hidden layers allow a neural network to delve deeper into the data set to find patterns, and thus such models are categorized as deep learning models.
Training Neural Networks: Explanation
In order to train a neural network, the data must be fed to it in the proper format. Arrays may be the most appropriate, but other formats like data frames may work as well. Once the data is fed, the neural network is trained via a process called iterative learning. Before the training process is started, all of the weights associated with the input values are randomly initialized. Each row in the data set is presented to the network one at a time; the network makes a prediction for that row and then compares its prediction with the actual result in the row. The error between the actual and predicted value is backpropagated through the network, and the weights are adjusted accordingly. Then the next row is presented to the neural network, and this process is performed on the entire data set.
Model training is initiated by calling the .fit() function in Python’s keras module. It takes the data points, the labels corresponding to those data points (whether an image is a of a cat or a dog, whether a transaction is fraudulent or not [1 or 0], etc.), and epochs as arguments. Epochs is the number of times to train the model on the entire training data. For example, suppose that a data set contains 10,000 rows. If a model is trained on that data set, the weights will be adjusted 10,000 times. If the number of epochs is 5, the model will be trained on the entire data set 5 times, and therefore the weights will be adjusted a total of 10,000 * 5 = 50,000 times. Epochs are useful as they allow for a neural network to be refined and trained as much as possible to increase its accuracy and predictive power. Training a neural network multiple times on a data set also works toward minimizing the model’s loss (a measure of error between the actual and predicted values) via gradient descent (an algorithm that refines the weights to approach the global minimum of the loss function) and maximizing the model’s accuracy (the ratio of the number of correct predictions to the number of total predictions). A neural network’s iterative learning process and improvement in classification can be illustrated by the graphs below:
Training Neural Networks: Problems
One may think that by adding more and more layers to a neural network, it should be able to make better predictions because it is performing increasingly sophisticated analysis on the features of the data set. One may also think that in the worst-case scenario, redundant layers in a neural network should simply do nothing — neither improve nor decrease the accuracy. After all, this makes intuitive sense. Unfortunately, after a certain point, adding additional hidden layers to a neural network only marginally improves the model’s predictive power. In more severe situations, it can even decrease the accuracy.
In fact, as more layers are added, there is a pattern: earlier layers learn much slower than later layers. This problem is called the vanishing gradient problem. The opposite phenomenon can occur as well, but similar to its counterpart, it is not pleasant! Here, as more layers added to the neural network, the gradient gets larger for earlier layers. This problem is called the exploding gradient problem. Overall, these problems show that the gradient descent algorithm is not always the best technique for training a neural network, due to its instability across various layers.
A typical approach to gradient descent is to randomly initialize the weights, and use the algorithm to recursively approach the loss function’s global minimum. However, this random initialization is not always a good idea in regards to training deep learning models, and the way the weights are initialized can actually make a substantial difference in model training and reliability (Sutskever, Martens, Dahl and Hinton, 2013).
Another problem with neural networks is that they may not always be the most computationally efficient models: they are generally very expensive to train. As the number of features in a data set increases, the time it takes to train a neural network on that data set increases exponentially. Given that data sets in the real world could contain thousands or millions of features, using feature selection in deep learning is of paramount importance. Feature selection is the process of choosing the most statistically significant features in a data set for a model to be trained on (features that have the most influence on the output variable), and eliminating the remaining features, which may have little to no influence.
Overall, despite the possible difficulty in training them, neural networks are extremely powerful models that can be used to solve sophisticated classification problems (classifying both regular data and images). The layered nature of these deep learning models allows them to analyze increasingly complex patterns and features in the data set, which explains why they are used for analyzing highly sophisticated data such as images. Additionally, the manner in which their iterative learning process works further helps them truly understand the training data and then look for similar patterns in the testing data. The sophisticated nature of deep learning models explains how they are able to deal with very noisy data, which is an essential ability to have in order to truly understand chaotic data in the real world.