Source: Deep Learning on Medium

I have come across a wonderful book by Terrence Sejnowski called The Deep Learning Evolution. The book describes the early struggles of the field with its remarkable breakthrough in the begging of the 21st century. More interestingly, Sejnowski shows analogies of deep learning and neural physiology of the brain.

People, who are interested in deep learning know that there is a buzz word associated with the field of neural computation — Artificial Intelligence. Many significant researchers argue that the modern approach to implementing AI is far different from the original (and more philosophical) definition of artificial intelligence, Michael Jordan is one of them. Hence, many researchers think of deep learning as machine learning or computational applied statistics.

Terrence Sejnowski makes a lot of great comparisons based on research in the fields of neuroscience, neural computation, biology, physiology, and physics; it does seem that deep learning as a field is closely connected to at least general concepts of brain functioning; however, it has a statistical foundation. Our brain doesn’t implement back-propagation as is (or at all), and it doesn’t even use real numbers to encode signals as we do in deep learning. The way a brain learns representations is still an area of active research that fascinates me the most.

Deep learning, in turn, is here to stay for a long time, at least as a very robust machine learning instrument. As you might know, neural networks work well for a wide variety of tasks where other machine learning algorithms fail. However, I was puzzled with the task of interpreting my networks’ predictions in my research. There are a few methods that try to explain some of the decision process underlying convolutional networks, i.e., class activation mapping. Although, for the most part, neural networks remain to be black box algorithms (good luck explaining to your client why he was denied in mortgage application). Hence, a lot of research is being conducted in the field of the theoretical foundation of deep learning. *This article is intended to be a reminder of some of the theory behind neural networks to inspire in self further research.*

*General Definition of a Neural Network*

Artificial Neural Network with at least one hidden layer is *a universal function approximator*, parametrized by a set of weights. The underlying graph of signal flow defines differences between different architectures such as Fully-Connected, Convolutional and Recurrent Networks, but most of them are following the same statistical learning rules.

*What Does It Mean To Train a Network*

This is a very crucial question for the whole field of machine learning. I want to start with defining a few things. First of all, neural networks consume data to predict (classify or regress). The data that is available to us as practitioners is training data; we are going to use it to train our model. However, **the goal of any machine learning algorithm is to get trained subjected to the constraint: the model should generalize well on unseen samples**. Here I will only consider the case of classification as it is the most common task for a machine learning algorithm.

**There is a data-generating process that generates data.** There is a hypothetical data-generating distribution according to which this data is distributed. Some of this data can be collected, and it is, then, distributed according to an **empirical distribution**. If we pass this data through our model, we obtain the **model distribution**. These are the three main distributions we need to know about to train a neural network, and the concept of generalization is directly derived from these distributions.

**Our ultimate goal is to learn the data-generating distribution**, if we can do this, we can predict the classes with astonishing accuracy, even though there is still a **noise floor** (or Bayes error). The noise floor is a noise that is embedded in the data-generating process (typos, noise in gear, etc.). Hence we can’t avoid it, and it will end up been learned by our model.

We assume that the samples we will have to generalize to **are coming from the same data-generating process and follow the same data-generating distribution as the samples we have empirically observed**. It is crucial to mention that the phenomenon of overfitting comes into play if the empirical distribution doesn’t represent the data-generating distribution (not enough samples are available). As the result of the lack of examples, our neural network will model the empirical distribution closely and will fail to generalize well, since the empirical distribution is not entirely representative of the data-generating process. This is known as **overfitting**. As you may already know, one of the extreme ways to improve your model’s performance is to collect more data or make you empirical distribution more representative. Collecting more data will always enhance the generalization power of your model, considering that there are patterns and low noise floor. However, collecting data may not always be tractable or may be extremely expensive; therefore researchers came up with other ways of fighting overfitting: **regularization**. I will write about regularization a little later.

Now, what is the model distribution and what do we do with it? The parameters or weights that our neural network uses to approximate a function can yield different families of distributions that our model can represent.

Remember that we don’t have access to the data-generating distribution; hence **we have to approximate it via the empirical distribution**. We are going to do so by making our model distribution as close as possible to the empirical distribution and use our model distribution to infer. Luckily, we have statistical tools to perform this task.

*Kullback-Leibler Divergence And Cross-Entropy*

When I started practicing building neural networks I heard all of these terms such as KL divergence, Binary Cross-Entropy Loss, Cross-Entropy Loss and so on. But I never looked beyond the documentation of a deep learning library and took these for granted. The time had come when I became interested in understanding how the neural networks work (beyond the back-propagation and parameter update cycle) and here is briefly what I learned.

**Kullback-Leibler** divergence measures the dissimilarity of two distributions. Cross-Entropy does the same but in a different way. So here is a little math.

Where,

Notice that only the model distribution is parametrized by the weights, so it is the only one we would be able to alter in any way. Now, let’s expand the right hand side of the KL divergence:

Notice that the second term in the right hand side is now parametrized by weights and now this is the only term we can alter. So we can minimize the Kullback-Leibler divergence (measure of dissimilarity) between the empirical and model distributions by minimizing the Negative Log-Likelihood of the model:

**Cross-Entropy** between two distributions is tied into this process by having the Kullback-Leibler divergence as a term in its definition:

Here, **H()** is the Entropy of its argument, which is the empirical distribution and is defined as:

Now, notice what happens if we substitute the entropy of the empirical distribution into the definition of the Cross-Entropy:

This is pretty neat. Now, we can write the Cross-Entropy in the form of expectation:

From this and the previous statements we can see that if we want to minimize the Cross-Entropy (cross-entropy loss in many deep learning libraries) we need to minimize the Negative Log-Likelihood of the model (cross-entropy loss in many libraries typically calculate Negative Log-Likelihood Loss and Log-Softmax under the hood, like in *PyTorch*).

But the big question still remains unanswered: how does the neural network learn by minimizing the Kullback-Leibler divergence or Cross-Entropy?

It turns out that there is a link between the Maximum Likelihood Estimate of the weights and the the two aforementioned losses. The MLE for the weights is defined as:

Therefore, we can see that **the minimization of the Kullback-Leibler divergence and Cross-Entropy between the empirical and the model distributions is equivalent to finding the Maximum-Likelihood estimate for the parameters of the neural network**, i.e. weights.

Our model can find the set of parameters that results in the best fit to the evidence (empirical distribution), i.e. training data through the process of minimization of Kullback-Leibler divergence or Cross-Entropy between the empirical distribution and the model distribution, or equivalently minimization of the Negative Log-Likelihood Loss.

*If the empirical distribution is highly representative of the data-generating distribution, the model that fits the training data will be able to generalize very well on the unseen examples constrained to the assumptions that the unseen examples come from the same data-generating distribution.*

This is the first part of the theory behind the neural network training, in the next part I will try to elaborate on hypothesis space and the problem of overfitting.