Source: Deep Learning on Medium

And Its Learning Rules Derived From Scratch

In my previous post I introduced the structure and the function of a perceptron — a building block of every neural network. I concluded my post by bringing your attention to the fact that, while I defined most variables and functions that go into the perceptron, I didn’t mention anything about what happens to the weights. Well, the weights themselves and how they’re set or changed are actually at the heart of what it means to ** train a perceptron**. My goal here is explain how a single perceptron can be trained. If you understand how this works on a single perceptron, then you should be able to build up a solid intuition about how a neural network with a more complex architecture gets trained in general.

In case you’d like a refresher or have not seen my post on the structure and function of the perceptron, check it out in this article below even though I’ll quickly summarize the main points before moving forward into training a perceptron.

**The Perceptron — The Building Block of Neural Networks**

*And How It Can Compute A Logical Statement*blog.usejournal.com

NOTE: Some experience in calculus is recommended for this post.

#### A QUICK REFRESHER ON THE PERCEPTRON

This diagram is taken directly from my previous article:

Here is a recap of what’s illustrated in the diagram:

- This diagram is of a binary classifier perceptron that can compute simple logical statements.
- The activity function A is a weighted sum of the input variables x and some bias theta. It defines the line or the linear boundary which can bisect the input space in a way such that the binary data points can be separated. The property of a straight line being able to separate such points in space is called
*linear separability.* - The
term determines the vertical placement of the line in the input space.*bias* - The
is a function that takes the activity value as its argument (the output of*activation function f*) and maps it to either a 0 or a 1 to ensure that the final output of the perceptron is binary.*activity function A*

#### SOME CONVENIENT NOTATION

Let’s introduce a slightly refined and generalized representation of the perceptron above:

I’m adopting this new notation to make the upcoming parts of this post easy to understand. Here we have the following new notations:

*w_ij —*the weight that links input*x_i*to perceptron*j*(the node in the middle of the diagram).*y_j —*the output of perceptron*j*.

Now let’s dive right into training a perceptron.

#### WEIGHTS — THE THINGS THAT THE PERCEPTRON LEARNS

Recall that the components that make up a perceptron are the inputs, the weights, a bias, the activity function and the activation function, while the output is a final quantity that depends on all of these things. The inputs are fixed — they are the data! We can’t make any changes to the data. The activity function and the activation function are just that — functions, not quantities to be changed around either. So the only thing that we actually have control over to change around when training a perceptron are ** the weights**. This means that …

…when we train a perceptron, we make changes to the weights iteratively until we find the best set of weights so that the activity function produces the correct slope and orientation of a line in order to take advantage of linear separability.

#### SUPERVISED LEARNING — THE PREREQUISITE FOR TRAINING

Let’s think of a perceptron in the context of supervised learning. Imagine that we have a some label for what the output y_j of perceptron j needs to be: let’s call this our ** desired output d_j**. This desired output d_j is the ground truth, the target output we know the perceptron should be producing. In other words, this desired output

**— thus the term supervised learning.**

*supervises our perceptron in what weights it should learn in order to produce a certain output*If we know what the desired output is, we can use some math to our advantage to define some function that will allow us to compare the desired output d_j to the ** actual output y_j **that the perceptron produced. But we also want this function to be dependent on the weights since the weights are what we want to update when training a perceptron. Let’s call this function an

**because if the actual output is not equal to of the desired output, then the value of the error function is non-zero and we can say that the perceptron has made an error! I’m calling it an error function here to stick with intuitive language but this is what is conventionally referred to as the**

*error function***.**

*loss function*Our goal is then to find the minimum of this function — as error is something we wish to minimize.

We can define some function that depends on the weights and which will allow us to relate the actual output of the perceptron to some desired output. This function is the error function and the goal is to minimize it by iteratively updating the weights.

Once the error is minimized, we can they say the perceptron has been trained.

#### THE STEPS OF TRAINING THE PERCEPTRON

#### 1. Initialize the Weights and Calculate the Actual Output

Let’s look at the perceptron again:

The input x_i is multiplied by a randomly initialized weight w_ij and is fed into the perceptron along with a bias and all other weighted inputs. Inside the perceptron the activity function is applied to this weighted sum of inputs plus a bias and its value is then fed as an argument into the activation function f. The value of the activation function produces the output y_j of perceptron j.

#### 2. Calculate the Error

Before we can calculate the error, we first need to define the error. Remember that I mentioned that the error function has to be dependent on the weights and it needs to relate actual output y_j to desired output d_j. So let’s define it as this:

This function e_j does relate d_j to y_j and is in fact dependent on the weights because we know the y_j term itself is dependent on the weights. We know that we want to minimize e_j. But depending on the values of d_j and y_j, we might get a negative value for the error e_j. So let’s square it to ensure that the error is always a positive number and define it as

So then, we’ve redefined the error function e_j as E_j to ensure that it is never negative. But I’ve also included a factor of 1/2. As one of my favorite professors used to say, ‘we do what is convenient in mathematics’. The factor of 1/2 is there to make things more convenient later. I will come back to this shortly.

#### 3. Gradient Descent — Updating the Weights to Further Reduce Error

Now that we have an error function, we can use it to help us determine how to update w_ij so we can reduce the error further. Let’s define some mathematical expression for what an updated weight looks like based on current the current and the error function E_j:

The equation in the box is an expression for finding an ** updated weight w_ij(k+1) **using

- the
*current weight w_ij(k)* - some
given by the greek letter*step size**eta* - the
*derivative of the error function E_j with respect to w_ij.*

This is, in other words, the formula of ** gradient descent — **an iterative method of finding the minimum of a function.

Gradient Descent — given an initial starting point w_ij(k) we want to move in a direction of steepest descent which can be calculated by the derivative of the error function scaled by some amount eta — thus arriving to some new point w_ij(k+1) which corresponds to the point w_ij(k) either reduced or increased by some amount that is equivalent to the quantity eta * derivative of E_j.

#### Why the Partial Derivative?

A question arises: Why have I indicated the derivative of the error function with a ** partial derivative**? Well, the graph above is a graph of the error as a function of a

**. In more realistic cases a perceptron will have multiple weights. If we had 2 weights for example, the the graph of the error function would be a 3 dimensional surface in 3 dimensional space, thus two partial derivatives of the error function would need to be calculated with respect to both weights. More generally, if we had n weights, then the error function would be an n+1 dimensional hyper-surface in n+1 dimensional space and its derivative would be calculated with the help of a**

*single weight***with n components — each representing a partial derivative of the error function in the direction of a single weight.**

*gradient vector*But let’s go back to the expression of an updated weight w_ij(k+1). From this we can actually derive the quantity by which we need to change the current weight during the next iteration (either increase it or decrease it by some amount) in order to further reduce E_j. This quantity of change to the current weight is denoted by the term ** delta w_ij **in the box below:

The expression in the box above is an application of the method of gradient descent in the context of the perceptron where we try to modify the weights in order to reduce the value of the error function.

#### Derivative of the Error Function

So how do we calculate the derivative of the error function? To calculate this derivative we have to acknowledge that the error function E_j is pretty complicated ** composite function** composed of function e_j which in turn is composed of yet another function y_j, etc. More specifically,

thus, we need to make use of the ** chain rule **to take the derivative of E_j with respect to the weight w_ij.

Next, we need to compute each of the four partial derivative terms on the rights side of the above equation, but that’s not a very difficult task given that we know E_j, e_j, y_j and A_j:

Now that we have each of those terms computed, we can plug them back into the chain rule equation for the derivative of E_j and we will get the following expression:

#### THE PERCEPTRON DELTA FUNCTION — Putting it all together

Again, in the above expression recall that the subscripts i_j for the weight indicate that this is the weight linking input i to perceptron j. The partial derivative of E_j here is in the direction of the weight w_ij only, and therefore this derivative is only one of the elements of the gradient vector of of E_j for perceptron j. Notice all the terms excluding x_i have the same subscript j indicating they all relate to output of perceptron j!

Let’s simplify even further by grouping the terms that share the same subscripts as follows:

And plugging this back into the expression for ** delta w_ij **we get

The expression in the box is the ** perceptron delta function** —

*a function that quantifies by how much a current weight w_ij(k) needs to be changed in order to obtain the updated weight w_ij(k+1).*

theperceptron delta functionquantifies by how much a current weight w_ij(k) needs to be changed in order to obtain the updated weight w_ij(k+1) which will contribute to the further reduction of the error.

Iteratively repeating this process until the error is minimized is the very process we call training the perceptron.

#### THE MULTIVARIATE CASE

The computations above were done for a single perceptron j that contributes to the output y_j using a single weight w_ij. If we had m perceptrons contributing to some output y_j (that is m perceptrons in the output layer of a neural network for example) then the formula for the error function would be given as

Furthermore, if we want to account for all n inputs with the n weights that link them to every perceptron, then for each perceptron j in the above formula, we would need to calculate n updated weights using the perceptron delta function n times, and thus n partial derivatives of E_j that are components of the gradient vector of E_j. Thus you can see how this quickly becomes incredibly complex as soon as we start accounting for more weights and more perceptrons — attesting to the computational complexity that neural networks are a quite notorious for.

*RECAPPING ALL THE MATH MESS*

*RECAPPING ALL THE MATH MESS*

is an optimization problem which involves iteratively updating the weights in a way that minimizes the error function.*Training a perceptron*- We derived the
and defined what an*error function*should be based on a current weight and the error calculated at the current iteration. We did this all using an example of a single weight and a single perceptron.*updated weight* - We then derived
which quantifies the change that needs to be added to or subtracted from the current weight in order to arrive at the updated weight that further reduces the error.*the perceptron delta function*

For my next post, I will build on the mathematical formulations derived herein to do a walkthrough of the ** feed forward back propogtation algorithm** as applied to an example of a simple multi-layered perceptron (MLP). ___________________________________________________________________

#### SIDENOTE — Why Use Gradient Descent and Not Set the Derivative to Zero to Find the Minimum?

A question arises as to why not simply minimize the error function by setting its derivative to zero. Well, if the number of weights gets large, then the error function quickly becomes very complex. Let’s say the graph below is a slice or a cross sectional view of some 3 dimensional (three weights) error function that makes up a surface.

As you can see the error function is very complex! The problems with such complex functions are :

1. we have many minima

2. such functions are too complicated to be expressed mathematically in a neat formula

3. taking the derivative of such complex functions may not even be feasible without a neat mathematical formula

4. finding the arguments (the optimal weights) that correspond to the global minima is not realistic.

Minimizing a function by setting its derivative to zero is an analytic approach to solving optimization problems, but analytic solutions aren’t always accessible to us. Therefore we can use an alternative method that is computational — a numerical approximation to this analytic method. That’s what gradient descent is — an iterative method of finding the minimum of a function by randomly initializing some weights from which we can move forward or backwards along the surface in the direction of the steepest descent until we get to the weights that correspond to the minimum of the surface.

- This content was adopted from
*Rojas, Raul (1996): Neural Networks: A Systematic Introduction. Berlin: Springer-Verlag Publishers. Ch 4.1–4.2.5**and from lecture notes of Dr. M. Fleischer — professor at Johns Hopkins University, Neural Networks Introduction graduate course.*