Understanding Neural Networks with High School Math

Original article was published on Deep Learning on Medium

Understanding Neural Networks with High School Math

Are you a high school student who wants to learn about neural networks but doesn’t understand the crazy, complicated math? Well, let me help you…

Introduction

Neural networks: the most recognized and well-known algorithm associated with Artificial Intelligence right now. Whether this claim is true (or not true), many students that I know often hear this word before learning about Support Vector Machines, Naive Bayes, or Logistic Regression. They get fascinated with the “brain-like” structure of the network and its ability to classify images of cats and dogs, generate handwritten digits, filter e-mail, and even predict the future😕?

Although some of the problems that researchers, companies, and the media claim neural networks can solve are implausible, neural networks continue to become more and more robust and practical every day. Due to this, students interested in Artificial Intelligence first start learning by developing a basic neural network to solve a common problem, such as the XOR problem.

However, many students struggle and fail in the beginning because they lack the mathematical knowledge that is so critical to truly understanding how neural networks work. Students, often those that are in high school or starting college, get overwhelmed with the convoluted notations, derivations, and explanations.

Here is an example from Andrew Ng’s lecture that can be hard to understand, especially for high school students

Well, I’m here to tell you that there is hope! Although many lectures, articles, and videos use Multivariable Calculus and Linear Algebra to explain neural networks, High School Calculus and Elementary Algebra is enough for you to understand what is going on and move forward in your Artificial Intelligence education.

In this article, I’m going give you a straightforward explanation on the math behind neural networks. I’m going to break this up into 4 parts:

  1. Creating the Network
  2. Forward Propagation
  3. Backpropagation
  4. Testing it out using code

Prerequisites

In order to understand the math, you must know high school calculus. This means that you finished AP Calculus AB/BC or know Calculus I/II.

You also need to know common terminology, like weights, backpropagation, and convergence.

Starting Simple

The first step to learning about the vast field of Artificial Intelligence and Neural Networks is to start simple.

We will be using a very basic neural network architecture composing of 1 input layer with 2 neurons, 1 hidden layer with 2 neurons, and 1 output layer with 1 neuron. Don’t worry about the letters like h, a, b, or o. All you need to know is that the inputs are represented by i₁ and i₂, the weights are represented by w₁, w₂, w₃, w₄, w₅, and w₆, and the output is represented by o₁.

The activation function we will be using is the sigmoid function. If you don’t know what an activation function is a way to keep the neural network non-linear. Without the activation functions, the neural network could perform only linear mappings from inputs to the output.

The sigmoid function may look familiar to those who remember logistic growth from Calculus II. A sigmoid function is just a logistic function where the carrying capacity is 1 and the inflection point is at (0, 0.5).

Moving Forward

We will start by multiplying the inputs with their corresponding weights.

think of h₁ and h₂ as two functions

Essentially, every input neuron connects to every neuron in the hidden layer. For everyone neuron in the hidden layer, sum up the product of the input neuron that is connecting to it and the weight of the link in between them.

Next, we have to apply the activation function to the products.

This is basically a composite function (a function within a function)

Then, we apply the same logic as we move from the hidden layer to the output layer.

Once again, think of b₁ as a function
You can’t forget to apply the sigmoid function!

Putting it all together, this is what forward propagation looks like in one line:

Notice that this is just a very complex composite function. Also, if you removed σ, then this would just be a linear function.

Moving Backward

The whole goal of a neural network is to take some inputs and match it to a certain output. This is accomplished through backpropagation. We first figure out how incorrect the network is using an error function. Next, we evaluate how much the weights affect the error of the network. Finally, we use gradient descent to minimize the error. After repeating these 3 steps many times, we will eventually converge upon a neural network where the inputs will directly match the output.

The math behind backpropagation is done using partial derivatives, which require multivariable calculus😢. However, you can understand the derivation by using only high school calculus😃.

The first step is to define the error function:

This is called sum of squares for errors. The 1/2 is in front because it makes the derivation easy (think power rule).

Next, we define the formula that we will be using. Essentially, we will be moving in the direction of steepest descent as defined by the negative of the gradient.

The symbol is the same as the d symbol for derivatives, except it signifies that you are taking a partial derivative. A partial derivative is just a regular derivative except that there are more than one variables in the function. For example, the function might be f(x,y) = x² + y². If you are taking the partial derivative with respect to x, you treat all other variables (in this case, y) as constants (like the number 1, 2, 3, or C).

Also, the α in front of the gradient is called the learning rate. In order to take small steps, we need to define a small learning rate.

Also, here is a refresher on some of the variables that we will be using:

So, for finding the gradient for w₅ and w₆:

Here, you can see that the left side equals the right side after crossing out the numerators and denominators on the right side. This is an example of the chain rule because we are trying to take the derivative of a function (which is error) with respect to a variable (which is w₅) when w₅ isn’t a part of the function (which is error). So, we take the derivative with respect to a variable found in the error function (which is o₁) and then we do the same for another variable until we finally reach a function that contains w₅ (which is b₁). Then, we take the derivative of b₁ with respect to w₅ to find the gradient.

In order to actually solve this equation, we need to substitute in the values for error, o₁, and b₁.

Now, simplifying this equation may seem daunting, but all it requires is basic knowledge of derivatives. For the first part, you can use chain rule and power rule to simplify the error. All you have to do is treat actual as a constant value. For the second part, you may not know how the take the first derivative of the sigmoid function. However, it has been proven in the past that:

So, all you have to do is treat b₁ as x and use the first derivative above. Finally, for the third part, you may remember that b₁ from before we started this derivation. All you have to do is treat w₅ as x, the rest of the variables as constants, and take the first derivative with respect to x. If you do b₁ = ax + C and the first derivative of b₁ = a.

Putting all of this together:

Similarly, you can repeat this same process for finding the gradient for w₆:

Now, the next step is to move backwards one layer and find the gradients for the next set of weights: w₁, w₂, w₃, and w₄. The process is the same as above but you have to take more partial derivatives before reaching the weights.

gradient for w₁
Everything simplifies out

Just like before, we plug in the values for error, o₁, b₁, a₁, and h₁, and then solve using power rule, chain rule, and what we already know.

In sum, here are the gradients for all of the weights:

Notice the Δ; it makes it easier to write out the equation.
Also, notice how some of the calculations being done for all of the weights.

Testing It Out

Even though all of this may seem to be true, we can’t be 100% sure until we have applied this math. So, in order to make sure that these calculations are correct, we will be testing out the neural network with some dummy data.

Our inputs, outputs, and initial weights will be:

Here, I’ve defined the same in Python 3.x, which I will be using to test this out.

Also, here is the sigmoid function that we will be using:

Now, here is the forward propagation:

Here is the backpropagation:

Notice that I am not using lists. This is so you can understand the code.

Here is the loop that will call both methods. We will be doing this for 100,000 epochs:

We first do forward propagation, print the results, and then do backpropagation

And, here are our results!

Near the beginning, we are far from out target prediction of 1
However, we soon converge upon a close approximate to 1

Conclusion

Congrats, you just learned the math behind neural networks! Using high school calculus and elementary algebra, you should now be able to apply the notations, derivations, and explanations learned in this article to more complex neural network structures.

Going forward, I recommend that you do the following:

  • Learn how to apply Linear Algebra to this problem by creating a vectorized approach to forward and back propagation. By using matrices and vectors for the math, you can write less code and use more data to train a neural network.
  • Expand upon your understanding of Gradient Descent. There are different variants of gradient descent are defined on the basis of how we use the data to calculate derivative of cost function in gradient descent. Examples are Batch Gradient Descent, Stochastic Gradient Descent, and Mini-Batch Gradient Descent.
  • Start creating your own neural networks. Apply what you learned in this article to your own projects. See if you can create a neural network from scratch using the math from this article to create a neural network that can solve the XOR problem.