And now in the last 5–7 years it turned out that in many tasks related to the analysis of natural information, and everything that surrounds us is natural information, this is language, this is speech, this is an image, video, a lot of other very different information, — neural networks are better than other algorithms. At least for now. Perhaps the renaissance will end again and something will replace them, but now they are showing the best results in most cases.

The fact that neural networks as an algorithm of machine learning, they need to be trained. But unlike most algorithms, neural networks are very critical to the amount of data, to the volume of the training sample, which is necessary in order to train them. And on a small amount of data, networks just don’t work well. They generalize poorly, work poorly on examples that they did not see in the learning process. But over the past 15 years, the growth of data in the world may become exponential, and now this is no longer such a big problem. We have a lot of data.

The second cornerstone of network renaissance is computing resources. Neural networks are one of the heaviest machine learning algorithms. Enormous computing resources are needed to train the neural network and even to apply it. And now we have such resources. And, of course, new algorithms were invented. Science is moving on, engineering is moving on, and now we understand more about how to train such structures.

Okay then, now we will complicate this post a bit and take a look at more complex things. So, let’s start with the smallest detail — neuron architecture.

What is a neuron?

This is a very simple element that has some limited number of inputs, some weight is tied to each of these inputs, and the neuron simply takes and carries out a weighted summation of its inputs. The input may be, for example, the same image pixels that I talked about before.

Imagine that X1 and everything before Xn are just all the pixels in the image. And each pixel has some kind of weight attached. He summarizes them and carries out some nonlinear transformation over them. But even if you do not touch the linear transformation, then already one such neuron is a fairly powerful classifier. You can replace this neuron and say that it is just a linear classifier, and the formal neuron is just that, it is just a linear classifier.

If, for example, in a two-dimensional space we have a certain set of points of two classes, and these are their signs X1 and X2, that is, by choosing these weights V1 and V2, we can build a separating surface in this space. And thus, if we have this amount, for example, greater than zero, then the object belongs to the first class. If this sum is less than zero, then the object belongs to the second class.

And everything would be fine, but the only thing is that this picture is very optimistic, there are only two signs, classes, as they say, are linearly separable. This means that we can simply draw a line that correctly classifies all objects of the training set. In fact, this does not always happen, and almost never happens. And therefore, one neuron is not enough to solve the vast majority of practical problems.

This is a non-linear transformation that each neuron performs on this sum. It is critically important, because, as we know, if, for example, we perform such a simple summation and say that this is some new feature Y1 (W1x1 + W2x2 = y1), and then we have, for example, another second neuron, which also summarizes the same features, only it will be W1’x1 + W2’x2 = y2.

If we later want to apply the linear classification again in the space of these features, this will not make any sense, because two consecutive applied linear classifications can easily be replaced by one, this is simply a property of the linearity of operations. And if we carry out some non-linear transformation over these features, for example, the simplest … Previously, we used more complex non-linear transformations, such as this logistic function, it is limited to zero and one, and we see that there are sections of linearity.

That is, it about 0 in x behaves linearly enough, like a normal line, and then it behaves nonlinearly. But, as it turned out, in order to effectively train classifiers of this kind, the simplest non-linearity in the world is enough — just a straightened line, when it is a straight line in the positive section, and it is always 0. In the negative section, this is the simplest non-linearity, and it turns out that even it is already enough to effectively train classification.

What Is a Neural Network?

A neural network is a sequence of such transformations. F1 is the so-called neural network layer. A neural network layer is simply a collection of neurons that work on the same features. Imagine that we have the initial signs x1, x2, x3, and we have three neurons, each of which is associated with all these signs.

But each of the neurons has its own weights, on which it weighs such signs, and the task of training the network in the selection of such weights for each of the neurons that optimize our error function. And the F1 function is one layer of such neurons, and after applying the function, we get some new feature space. Then we apply one more layer to this space of signs.

There may be a different number of neurons, some other nonlinearity as a transforming function, but these are the same neurons, but with such weights. Thus, sequentially applying these transformations, we get the general function F — the neural network transformation function, which consists of the sequential application of several functions.

How Do We ‘Train’ Neural Networks ?
In principle, like any other learning algorithm. We have some output vector, which is obtained at the output of the network, for example, a class, some sort of class label. There is some reference output that we know that these signs should have, for example, such an object, or what number we should attach to it.

And we have some delta, that is, the difference between the output vector and the reference vector, and further on the basis of this delta there is a big formula, but its essence is that if we understand that this delta depends on F n , that is from the output of the last layer of the network, if we take the derivative of this delta by weights, that is, by those elements that we want to train, and the so-called chain rule also applies, that is, when we have derivatives of a complex function.

That is, if we do not have an error on any particular training example, then, accordingly, the derivatives will be zero, and this means that we classify it correctly and we do not need to do anything. If the error in the training example is very large, then we must do something about it, somehow change the weights to reduce the error.

Gradient Descent and Backpropagation
When we extract the root using a calculator, we usually have little interest in how it does it. We know perfectly what we want to receive, and we give the invoice to the machine. But for those who are interested in this method, we will consider it briefly.

To begin with, one output neuron in our model is responsible for only one class. If this is an object of the class for which the neuron is responsible, we want to see one at its output, otherwise zero. In a real class prediction, as we already know, an artificial neuron is activated in the open range between zero and one, and the value can be arbitrarily close to these two asymptotes.

This means that the more accurately we guess the class, the smaller the absolute difference between the real class and the activation of the neuron responsible for this class.

Let’s try to create a loss function that returns the numerical value of the penalty.

Graphs of the penalty function as a function of neuron output:

In the event that the object belongs to this class (except one).
In the event that the object does not belong to this class (except zero).
Now it remains to write the loss function as an expression. Let me remind you Y for each ith element of a training sample of size m always takes values either zero or one, so that only one of the two terms will always remain in the expression.

Those who are familiar with the theory of information will recognize cross entropy in this expression. From the point of view of information theory, learning is to minimize cross-entropy between real classes and model hypotheses.

By initializing the coefficients of the weight matrix randomly, we want to make changes to them that will make our model better, in other words, reduce the loss. If it is known how much the weights influence the loss function, it will become clear how they need to be changed.

The partial derivative, the gradient, will help us. It is she who shows how a function depends on its arguments. By how many (ultra-small) quantities does the argument need to be changed so that the function changes by one (ultra-small) value. So we can reinitialize the weight matrix:

Repeat this step iteratively. In fact, this is a gradual gradient descent in small steps, the size of α (this parameter is also called the rate of learning), to the local minimum of the loss function. In other words, at each point defined by the current values of W, we find out the direction in which the loss functional changes in the fastest way and the learning dynamics resemble a ball that gradually rolls into a local minimum.

Gradient Descent:

def gradient_descent(model, X_in, Y, number_of_iteratons=500, learning_rate=0.1):

X = [[1]+[v.tolist()] for v in X_in]

m = len(Y)

for it in range(number_of_iteratons):

new_model = []

for j, w in enumerate(model):

error = 0

for i, x in enumerate(X):

error += (1/m) * (neuron(X[i], model) — Y[i]) * X[i][j]

w_new = w — learning_rate * error

new_model.append(w_new)

model = new_model

model_loss = binary_crossentropy(X, Y, model)

return model

The backpropagation continues this chain of reasoning for the case of a multilayer neural network. Thanks to it, it becomes possible to train deep layers based on gradient descent. Training takes place step by step from the last layer to the first. I think this information is quite enough to understand the essence of this method.

Neural network training example
Suppose we want to know the probability of a customer’s purchase based on just one parameter — his age. We want to create a neuron that will be excited in cases where the probability of purchase is more than 50%.

The neuron has one receptor associated with the age of the client. In addition, we add one bias member, which will be responsible for the shift (or for the offset). For example, it is located between them at the age of 42 years. The neural network should give a probability of purchase of less than 0.5 at the age of 42 years and more than 0.5 for older customers.

If you recall the activation function, it returns values greater than 0.5 for positive arguments and less than 0.5 for negative ones. So, we need the ability to shift this activation function to some threshold value.

At the same time, we expect such a rate of fracture of the activation function that would best fit the training sample. The degree of excitation of each neuron as a function of the feature vector x will depend on the coefficients of the weight matrix, and, accordingly, the probability that an element with such a feature vector will belong to this class.

Now we write it mathematically and understand why we need another bias-term in the weight matrix. To shift the function f (x) to the right, for example, by 42, we must subtract 42 from its argument f (x-42). In this case, we want to get a weak inflection of the function by multiplying the argument, for example, by 0.25 and get the next function f (0.25 (x-24)).

Opening the brackets, we get:

In our case, the desired coefficient of the weight matrix is w = 0.25, and the shift is b = -10.5. But we can assume that b is the zero coefficient of the weight matrix (w0 = b) if, for any example, the zero sign will always be unity (x0 = 1). Then, for example, the fifteenth “vectorized” customer with an age of 45 years, represented as x (15) = {x (15) 0, x (15) 1} = [1, 30], could buy with a probability of 68%.

All these coefficients, even in such a simple example, are hard to figure out “by eye”. Therefore, in fact, we trust the search for these parameters to machine learning algorithms.

For example, we are looking for two coefficients of the weight matrix (w0 = b and w1):

By initializing the coefficients of the weight matrix randomly and using the evolutionary algorithm, we got a trained neural network after a hundred generations. To get the desired learning outcome faster and without such sharp jumps, it is necessary to normalize the data before training.

Most accurately, the gradient descent method works. When using this method, data should always be normalized.

Model quality assessment
Having prepared the model, it is necessary to adequately assess its quality. To do this, we introduce the following concepts:

TP (True Positive) — true positive. The classifier decided that the customer would buy, and he bought.
FP (False Positive) — false positive. The classifier decided that the customer will buy, but he did not buy it. This is the so-called error of the first kind. It is not as scary as a mistake of the second kind, especially in cases where the classifier is a test for some disease.
FN (False Negative) — a false negative. The classifier decided that the client would not buy, but he could buy (or already bought). This is the so-called mistake of the second kind. Usually, when creating a model, it is desirable to minimize the error of the second kind, thereby even increasing the error of the first kind.
TN (True Negative) — true negative. The classifier decided that the client would not buy, and he did not buy it.
Wrapping it up….
My congratulations, if you have gone so far, you are a total hero! But, don’t rush things out, take some time for learning, because there are lots of complicated things ahead.

Here are some good sources for becoming a professional in deep learning:

As always, if you do anything cool with this information, leave a response in the comments below or reach out at any time on my Instagram and Medium blog.

Thanks for reading!