Analyze The Mathematical Ideas Behind Deep Learning

Original article was published by Md Ashikquer Rahman on Deep Learning on Medium


A deep neural network (DNN) is essentially formed by having multiple connected perceptions, where the perception is a single neuron. Think of an artificial neural network (ANN) as a system containing a set of inputs fed along a weighted path. It then processes these inputs and produces output to perform certain tasks. Over time, ANN “learned” and developed different paths. Various paths may have different weights, and in the model, paths that produce fewer ideal results and paths that are found to be more important (or produce more ideal results) are assigned higher weights.

In DNN, if all inputs are densely connected to all outputs, these layers are called dense layers. In addition, DNN can contain multiple hidden layers. The hidden layer is basically the point between the input and output of the neural network, and the activation function converts the input information. It is called a hidden layer because it cannot be directly observed from the input and output of the system. The deeper the neural network, the more the network can identify from the data, and the more information it outputs.

However, although the goal is to learn as much as possible from the data, deep learning models may suffer from overfitting. This happens when the model learns too much from training data (including random noise). The model can then determine very complex patterns in the data, but this will negatively affect the performance of the new data. The noise picked up in the training data is not applicable to new or invisible data, and the model cannot generalize the patterns found. Non-linear models are also very important in deep learning models. Although the model will learn a lot from content with multiple hidden layers, applying linear forms to non-linear problems will result in performance degradation.

The question now is, “How do these layers learn things?” Well, here we can apply ANN to real scenarios to solve problems and understand how to train the model to achieve its goals. The case analysis is as follows:

In the current pandemic, many schools have transitioned to virtual learning, which makes some students worry about their chances of passing the course. “I will pass this course” is a problem that any artificial intelligence system should be able to solve.

For the sake of simplicity, let us consider that the model has only 3 inputs: the number of lectures that students attended, the time spent on homework, and the number of times the Internet connection was lost throughout the lecture. The output of this model will be a binary classification. The student either passed the course or failed, which is actually 0 and 1. Now at the end of the semester, student A has participated in 21 classes, spent 90 hours on homework, and lost internet connection 7 times throughout the semester. These inputs are fed into the model, and the output predicts that students have a 5% chance of passing the course. A week later, the final grade was released, and student A passed the course. So, what went wrong with the model’s prediction?

Technically speaking, there is no problem. The model could have worked in the way currently developed. The problem is that the model does not know what happened. We will initialize some weights on the path, but the model currently does not know what is right or wrong. Therefore, the weights are incorrect. This is the main source of learning. The idea is that the model needs to master the rules of the wrong time. We do this by calculating some form of “loss”. The calculated loss depends on the current problem, but usually involves minimizing the difference between the predicted output and the actual output.

In the above situation, only one student and one error point can be reduced to a minimum. However, this is usually not the case. Now, if you consider minimizing multiple students and multiple differences, then the total loss will usually be calculated as the average of the differences between all predicted and actual observations.

Recall that the loss being calculated depends on the current problem. Therefore, since our current problem is binary classification (0 and 1 classification), the appropriate loss calculation will be cross-entropy loss. The idea behind this function is that it compares whether students will pass the course’s predicted distribution with the actual distribution, And try to minimize the difference between these distributions.

Instead, we no longer want to predict whether students will pass the course, but instead want to predict the points they will earn in the course. Therefore, cross entropy loss will no longer be a suitable method. On the contrary, the mean square error loss will be more appropriate. This method is suitable for regression problems. The idea is to try to minimize the squared difference between the actual value and the predicted value.

Now that we understand some loss functions (there is an introduction to loss functions here: Deep Learning Basics: Mathematical Analysis Basics and Tensorflow2.0 Regression Models, PDF books can be downloaded at the end of the article), we can perform loss optimization and model training. The key to having a good DNN is to have the appropriate weights. Loss optimization should try to find a set of weights W to minimize the calculated loss. If there is only one weight component, you can plot the weight and loss on a two-dimensional graph, and then select the weight that minimizes the loss. However, most DNNs have multiple weight components, and it is very difficult to visualize n-dimensional graphs.

Instead, the derivative of the loss function is calculated for all weights to determine the direction of maximum ascent. Now, the model can understand the upward and downward directions, and then move down until it reaches the convergence point of the local minimum. After this decent operation is completed, a set of optimal weights will be returned, which are the weights that the DNN should use (assuming that the model is well-developed).

The process of calculating this derivative is called back propagation, which is essentially the chain rule from calculus. Consider the neural network shown above, how does a small change in the first set of weights affect the final loss? This is what the derivative or gradient is trying to explain. However, the first set of weights is fed to the hidden layer, and then the hidden layer has another set of weights, resulting in the output and loss of prediction. Therefore, the influence of weight changes on the hidden layer should also be considered. Now, these are the only two parts of the network. However, if there are more weights to consider, you can continue this process by applying chain rules from output to input.

Another important factor to consider when training a DNN is the learning rate (which can be seen as a convergence factor in mathematics). As the model progresses to find the best set of weights, it needs to update its weights by some factor. Although this may seem trivial, it is very necessary to determine the factors of model movement. If the factor is too small, the model can run for an exponentially long time, or it can fall into a position that is not a global minimum. If the factor is too large, the model may completely miss the target point and then diverge.

Although a fixed ratio may be ideal, an adaptive learning ratio reduces the chance of the aforementioned problems. In other words, the coefficient will change according to the current gradient, the size of the current weight, or other places that may affect the next step of the model to find the best weight.

It can be seen that DNN is constructed based on calculus and some statistical data. Evaluating the mathematical ideas behind these in-depth technical processes is useful because it can help people understand what is really happening in the model and can lead to the development of better models overall.