# Building a 3 Layer Neural Network from Scratch

In this post, I will go through the steps for building a 3 layer Neural Network by going through a problem and explain you the process plus the most important concepts along the way.

#### The Problem to solve

A farmer in Italy was having a problem with his labelling machine – it mixed up the labels of three wine cultivars. Now he has 178 bottles left and nobody knows who made them! To help this poor man, we will build a classifier that recognizes the wine based on 13 attributes of the wine.

The fact that our data is labeled (the three cultivars) makes this a Supervised learning problem. Essentially what we want to do is use our input data – the 178 unclassified wine bottles – put it through our NN and then, get the right label for each wine cultivar, as output. We train our algorithm to get better and better results on making predictions (y-hat) for which bottle belongs to which label.

Now it is time to start building the Neural Network!

#### Approach

Building a Neural Network is almost like building a very complicated function, or a very difficult recipe. In the beginning the ingredients or steps you will have to take can seem overwhelming, but if you break everything down and do it step by step, you will be fine.

In short:

• The input layer (x) consists of 178 neurons.
• A1, the first layer, consists of 8 neurons.
• A2, the second layer, consists of 5 neurons.
• A3, the third and output layer, consists of 3 neurons.

#### Step 1- The Usual Prep

Import all necessary libraries (numpy, skicitlearn, pandas) + the dataset + define x and y.

`#importing all the libraries and dataset`
`import pandas as pdimport numpy as np`
`df = pd.read_csv('../input/W1data.csv')df.head()`
`# Package imports`
`# Matplotlib import matplotlibimport matplotlib.pyplot as plt`
`# SciKitLearn is a machine learning utilities libraryimport sklearn`
`# The sklearn dataset module helps generating datasets`
`import sklearn.datasetsimport sklearn.linear_modelfrom sklearn.preprocessing import OneHotEncoderfrom sklearn.metrics import accuracy_score`

#### Step 2- Initialisation

Before we can use our weights, we have to initialise them. Because we don’t have values to use for the weights yet, we use random values between 0 and 1. In python, the random.seed function generates ‘random numbers’. However, random numbers are not truly random. The numbers generated are pseudorandom, meaning the numbers are generated by a complicated formula that makes it look random. In order to generate numbers, the formula takes the previous value generated as its input. If there is no previous value generated however, it often takes the time as a first value.

That is why we seed the generator, to make sure that we always get the same random numbers. We provide a fixed value that the number generator can start with, which is zero in this case.

`np.random.seed(0)`

#### Step 3- Forward propagation

There are roughly 2 parts of training a Neural Network. First, you are propagating forward through the NN, you are ‘making steps’ forward and compare it with the real value to get the difference between your output and what it should be. You basically see how the NN is doing and find the error.

After we have initialised the weights with a -pseudo- random number, we take a linear step forward. We calculate this by taking our input A0 times the dot product of the random initialised weights plus a bias. We started with a bias of 0. This is represented as;

Now we take our z1 (our linear step) and pass it through our first activation function. Activation functions are very important in Neural Networks. Essentially what they to is convert an input signal to an output signal, hence why they are also known as Transfer functions. They introduce non-linear properties to our functions by converting the linear input to a non-linear output, making it possible to represent more complex functions.

There are different kinds of activation functions, which is explained in dept in this article. For this model, we chose to use the tanh activation function for our two hidden layers -A1 and A2 -which gives us an output value between 0 and -1. Since this is a multiclass classification problem (we have 3 output labels) we will use the softmax function for the output layer -A3-because this will compute the probabilities for the classes by spitting out a value between 0 and 1.

By passing z1 through the activation function, we have created our first hidden layer -A1-, which can be used as input for the computation of the next linear step, z2.

In python this process looks like this:

`# This is the forward propagation functiondef forward_prop(model,a0):        # Load parameters from model    W1, b1, W2, b2, W3, b3 = model['W1'], model['b1'], model['W2'], model['b2'], model['W3'],model['b3']        # Do the first Linear step     z1 = a0.dot(W1) + b1        # Put it through the first activation function    a1 = np.tanh(z1)        # Second linear step    z2 = a1.dot(W2) + b2        # Put through second activation function    a2 = np.tanh(z2)        #Third linear step    z3 = a2.dot(W3) + b3        #For the Third linear activation function we use the softmax function    a3 = softmax(z3)        #Store all results in these values    cache = {'a0':a0,'z1':z1,'a1':a1,'z2':z2,'a2':a2,'a3':a3,'z3':z3}    return cache`

In the end, all our values are stored in the cache.

#### Step 4- Backwards propagation

After we forward propagate through our NN, we backward propagate our error gradient, to update our weight parameters. We know our error and want to minimize it as much as possible. We do this by taking the derivative of the error function with respect to the weights (W) of our NN, using gradient descent.

Lets visualize this process with an analogy.

Imagine you went out for a walk in the mountains during the afternoon, but its an hour later and you are a bit hungry too so, its time to go home. The only problem is that it is dark and there are many trees, so you can’t see neither your home nor where you are. Oh, and you forgot your phone at home.

But then you remember your house is in a valley, the lowest point in the whole area. So if you would just walk down the mountain step by step until you don’t feel any slope, in theory you should arrive at your home.

So there you go, step by step carefully going down. Now think of the mountain as the loss function, you are the algorithm, trying to find your home (i.e. the lowest point). Every time you take a step downwards, we update your location coordinates (the algorithm updates the parameters).

The loss function is represented by the mountain and to get to a low loss, the algorithm follows the slope – that is the derivative -of the loss function. When we walk down the mountain, we are updating our location coordinates. The algorithm updates the parameters of the neural network. By getting closer to the minimum point, we are approaching our goal of minimising our error.

In reality, gradient descent looks more like this:

We always start with calculating the slope of the loss function with respect to z, the slope of the linear step we take.

Notation is as follows: dv is the derivative of the loss function, with respect to a variable v.

Next we calculate the slope of the loss function with respect to our weights and biases. Because this is a 3 layer NN we will iterate this process for z3,2,1 + W3,2,1 and b3,2,1. Propagating backwards from the output to the input layer.

This is how this process looks in Python:

`# This is the backward propagation functiondef backward_prop(model,cache,y):`
`# Load parameters from model    W1, b1, W2, b2, W3, b3 = model['W1'], model['b1'], model['W2'], model['b2'],model['W3'],model['b3']        # Load forward propagation results    a0,a1, a2,a3 = cache['a0'],cache['a1'],cache['a2'],cache['a3']        # Get number of samples    m = y.shape        # Calculate loss derivative with respect to output    dz3 = loss_derivative(y=y,y_hat=a3)`
`# Calculate loss derivative with respect to second layer weights    dW3 = 1/m*(a2.T).dot(dz3) #dW2 = 1/m*(a1.T).dot(dz2)         # Calculate loss derivative with respect to second layer bias    db3 = 1/m*np.sum(dz3, axis=0)        # Calculate loss derivative with respect to first layer    dz2 = np.multiply(dz3.dot(W3.T) ,tanh_derivative(a2))        # Calculate loss derivative with respect to first layer weights    dW2 = 1/m*np.dot(a1.T, dz2)        # Calculate loss derivative with respect to first layer bias    db2 = 1/m*np.sum(dz2, axis=0)        dz1 = np.multiply(dz2.dot(W2.T),tanh_derivative(a1))        dW1 = 1/m*np.dot(a0.T,dz1)        db1 = 1/m*np.sum(dz1,axis=0)        # Store gradients    grads = {'dW3':dW3, 'db3':db3, 'dW2':dW2,'db2':db2,'dW1':dW1,'db1':db1}    return grads`

#### Step 5: The Training Phase

In order to reach the optimal weights and biases that will give us the desired output, the 3 wine cultivars, we will have to train our neural network. I think this is very intuitive. For almost anything in life, you have to train and practice many times before you are good at it. Likewise, a Neural Network will have to undergo many epochs or iterations to give us an accurate prediction.

When you are learning anything, lets say you are reading a book, you have a certain pace. This pace should not be too slow- reading the book will take ages- but should not be too fast either, you might miss a very valuable lesson in the book.

In the same way, you have to specify a ‘learning rate’ for the model. The learning rate is the multiplier to update the parameters. It determines how rapidly they can change. If the learning rate is low, training will take longer. However, if the learning rate is too high, we might miss a minimum. The learning rate is expressed as:

• := means that this is a definition, not an equation or proven statement.
• a is the learning rate called alpha
• dL(w) is the derivative of the total loss with respect to our weight w
• da is the derivative of alpha

We chose a learning rate of 0.07 after some experimenting.

`# This is what we return at the endmodel = initialise_parameters(nn_input_dim=13, nn_hdim= 5, nn_output_dim= 3)model = train(model,X,y,learning_rate=0.07,epochs=4500,print_loss=True)plt.plot(losses)`

Finally, there is our graph. You can plot your accuracy and/or loss to get nice graph of your prediction accuracy. After 4500 epochs, our algorithm has an accuracy of 99.4382022472 %.

#### Brief Summary

We start by feeding data into the Neural Network and perform several matrix operations on this input data, layer by layer. For each of our 3 layers, we take the dot product of the input by the weights and add a bias. Next, we pass this output through an activation function of choice.

The output of this activation function is then used as an input for the following layer to follow the same procedure. This process is iterated 3 times since we have 3 layers. Our final output is y-hat, which is the prediction on which wine belongs to which cultivar. This is the end of the forward propagation process.

We then calculate the difference between our prediction (y-hat) and the expected output (y) and use this error value during backpropagation.

During Backpropagation, we take our error -the difference between our prediction y-hat and y- and we mathematically push it back through the NN in the other direction. We are learning from our mistakes.By taking the derivative of the functions we used during the first process we try to discover what value we should give the weights in order to achieve the best possible prediction. Essentially we want to know; what is the relationship between the value of our weight and the error that we get out as the result.

And after many epochs or iterations, the NN has learned to give us more accurate predictions by adapting its parameters to our dataset. 