How Neural Network “Learn”



Source: https://stats385.github.io/assets/img/grad_descent.png

In my first story, I explained how the neural network processes your input. Before neural network can predict as in the previous post, it must pass through a pre-processing phase. This phase governs the weight and bias values used by the neural network in processing your input.

There are 2 phases in the neural network life cycle and all machine learning algorithms in general are the training phase and the prediction phase. The process of finding the weight and bias values occurs in training phase. Meanwhile, the phase where the neural network processes our input to produce predictions as in the previous post occurred in the prediction phase. This time, I will discuss how neural networks get the correct weight and bias a.k.a “learn” to make an accurate prediction (read: regression or classification) during the training phase.

So, how do neural networks get optimal weight and bias values? The answer is through an error gradient. What we want to know when fixing the current weight and bias (which is initially generated randomly) is whether the current weight and bias values are too large or too small (do we need to decrease or increase our current value?) with respect to their optimal value? And how much its deviates (how much we need to decrease or increase our current value?) from their optimal values. The gradient we are looking for is derivatives of error with respect to weights and biases.

where E is error, W is weight and b is bias

Why is that? because we want to know how our current weights and biases affect the value of neural network error as a reference to answer 2 questions in upper paragraph (decrease or increase and how much). How we get the gradient value is through well known algorithm called backpropagation. How we utilize the gradient that has been obtained through backpropagation to improve the weight value and biases is through an optimization algorithm. One example of an optimization algorithm is gradient descent which is the simplest and most frequently used optimization algorithm. It just reduces recent weights and biases values with the gradient value obtained multiplied by the learning rate constant. What is learning rate and more details will we discuss immediately in this post.

Suppose we have a neural network as below.

Our neural network has a structure 3–2–2

Suppose we have a input vector, bias vector, weight matrix, and truth value as below

To make it unambiguous, The order of weights value is

Let’s do the forward pass. The process is the same as in the previous post. The activation function that we use for all neurons in this demonstration is the sigmoid function.

Here we round the output value of the sigmoid function to 4 decimals. In actual calculations, such a round will greatly reduce the accuracy of neural networks. The number of decimal is very crucial in neural network accuracy. We do this rounding to simplify calculations and so that the writing is not too long.

Before we proceed to the next layer, please note that the next layer is the last layer. It means that the next layer is the output layer. In this layer, we just do pure linear operation.

Its time to calculate the error. In this case we use Mean Squared Error (MSE) to calculate errors in the output neurons. The MSE equation is as follows

In our case, N = 1 because we just have 1 data, so the equation is reduced to

Let’s calculate the error of neuron in output layer based on the truth value (T) that we have defined earlier.

So that’s our current error in the output layer. Now is the time to minimize the error by looking for an error gradient with respect to weight and bias in every interaction between layer via backpropagation a.k.a backward pass and apply the gradient descent afterwards. Backpropagation is simply just a chain rule, how it work will be discussed immediately. For now, let’s find the derivative of all the equations we use in forward pass.

  1. Derivative of E with respect to O

2. Derivative of sigmoid (h) function with respect to P (output of pure linear operation)

where h is

3. Derivative of pure linear with respect to weight (W) and bias (b) and input (h).

where purelin is

where l is a number from 1 to M.

That’s all we need, its time to apply backpropagation. We first look for the gradient to weight and bias between hidden layer and output layer. to look for gradients, we use chain rules.

And with applying these, we get

So that’s our gradient for layer between hidden layer and output layer. Now, onto the next layer. Here the real challenge (not so challenge)! But don’t worry, after this everything will be clear and easy :).

Chain rule in backpropagation is all about path between neurons. Let’s collect the information!

  1. There are 2 neurons in hidden layer and every neuron is connected with 3 weight and 1 bias in left side (between input layer and hidden layer).
  2. In the right side, every neuron in hidden layer is connected with 2 neuron in output layer.

These information is very important to find the gradient of W1. And from these, the gradients we want to find are

where

All possible path from the weight we concern to output layer are added. That is why there is a sum of 2 terms in the equation above. Now, let’s count the real gradient of W1.

Substitute partial derivative of E with respect to h, we get

Now for biases a.k.a b1

And that’s the end of role of the backpropagation algorithm. Now, onto the optimization algorithm. Optimization algorithm is about how to utilize the gradient we have obtained to correct the existing weights and biases. optimization algorithm we choose is gradient descent. The way of gradient descent to correcting weights and biases by equation below.

where W’ is new weight, W is weight, a is learning constant and gradient is gradient we obtained from backpropagation. Learning constant is crucial constant because if this constant is too big, the result will not be convergent and if it’s too small, more iterations needed and that’s mean the training phase will be more time consuming. Suppose we have learning constant equal to 0.02.

And so on, this process will be repeated (with same input that will be entered and) until number of iteration that needed or target error has been reached.

So this is how neural network “learn” in general. If i have more free time (and good mood of course), I will share the source code of multi layer perceptron (another name of “ordinary neural network” which is our focus here) in python using numpy. See you.

Another neural network series by me:

  1. How Neural Network Process Your Input (Trained Neural Network)
  2. How Neural Network “Learn”

Source: Deep Learning on Medium