Original article was published by Rashida Nasrin Sucky on Artificial Intelligence on Medium

# Build A Neural Network From Scratch

## Build a complete Neural Network using a real dataset

The Neural Network has been developed to mimic a human brain. Though we are not there yet, neural networks are very efficient in machine learning. It was popular in the 1980s and 1990s. Recently it has become more popular. Computers are fast enough to run a large neural network in a reasonable time. In this article, I will discuss how to implement a neural network.

**I recommend that you please read this ‘Ideas of Neural Network’ portion carefully. But if it is not too clear to you, do not worry. Move on to the implementation part. I broke it down in even smaller pieces there.**

# How A Neural Network Works

In a simple neural network, neurons are the basic computation units. They take the input features and channel them out as output. Here is how a basic neural network looks like:

Here, ‘layer1’ is the input feature. ‘Layer1’ goes into another node layer2 and finally outputs the predicted class or hypothesis. Layer2 is the hidden layer. You can use more than 1 hidden layer.

You have to design your neural network based on your dataset and accuracy requirements.

# Forward Propagation

The process of moving from layer1 to layer3 is called the forward propagation. The steps in the forward-propagation:

- Initialize the coefficients theta for each input feature. Suppose, there are 10 input features. Say, we have 100 training examples. That means 100 rows of data. In that case, the size of our input matrix is 100 x 10. Now you determine the size of your theta1. The number of rows needs to be the same as the number of input-features. In this example, that is 10. The number of columns should be the size of the hidden layer which is your choice.
- Multiply input features X with corresponding thetas and then add a bias term. Pass the result through an activation function.

There are several activation functions available such as sigmoid, tanh, relu, softmax, swish

I will use a sigmoid activation function for the demonstration of the neural network.

Here, ‘a’ represents the hidden layer or layer2 and b is the bias.

g(z) is the sigmoid activation:

3. Initialize theta2 for the hidden layer. The size will be the length of the hidden layer by the number of output classes. In this example, the next layer is the output layer as we do not have any more hidden layers.

4. Then we need to follow the same process as before. Multiply theta and the hidden layer and pass through the sigmoid activation layer to get the hypothesis or predicted output.

# Backpropagation

Backpropagation is the process of moving from the output layer to layer2. In this process, we calculate the error.

- First, subtract the hypothesis from the original output y. That will be our delta3.

2. Now, calculate the gradient for theta2. Multiply delta3 to theta2. Multiply that to ‘a2’ times ‘1- a2’. **In the formula below superscript 2 on ‘a’ represents the layer2. Please do not misunderstand it as a square.**

3. Calculate the unregularized version of the gradient from diving delta by the number of training examples m.

# Train The Network

Revise the theta. Multiply input features to the delta2 times a learning rate to get theta1. Please pay attention to the dimension of the theta.

Repeat the process of forward-propagation and backpropagation and keep updating the parameters until you reach an optimum cost. Here is the formula for the cost function. Just a reminder, cost function indicates, how far the prediction is from the original output variable.

If you notice, this cost function formula is almost like logistic regression cost function.

# Implementation of a Neural Network

I am going to use a dataset from Andrew Ng’s Machine Learning course in Coursera. Here is the implementation of a neural network step by step. **I encourage you to run each line of code for yourself and print the output to understand it better.**

**First import the necessary packages and the dataset.**

`import pandas as pd`

import numpy as np

xls = pd.ExcelFile('ex3d1.xlsx')

df = pd.read_excel(xls, 'X', header = None)

This is the top five rows of the dataset. These are the pixel values of the digits. Please feel free to download the dataset and follow along:

In this dataset, input and output variables are organized in separate excel sheets. Let’s import the output variables in the notebook:

`y = pd.read_excel(xls, 'y', header=None)`

This is also the top five rows of the dataset only. Output variables are the digits from 1 to 10. The goal of this project is to predict the digits using the input variables stored in ‘df’.

**2. Find the dimension of input and output variables**

`df.shape`

y.shape

The shape of the input variables or df is 5000 x 400 and the shape of the output variables or y is 5000 x 1.

**3. Define the neural network**

For simplicity, we will use only one hidden layer of 25 neurons.

`hidden_layer = 25`

Find out the output classes.

`y_arr = y[0].unique()#Output:`

array([10, 1, 2, 3, 4, 5, 6, 7, 8, 9], dtype=int64)

As you can see above, there are 10 output classes.

**4. Initialize theta and bias**

We will randomly initialize theta for layer1 and layer2. Because we have three layers, there will be theta1 and theta2.

The shape of theta1: Size of layer1 x Size of layer2

The shape of theta2: Size of layer2 x Size of layer3

From step 2, the shape of ‘df’ is 5000 x 400. That means there are 400 input features. So, **the size of layer1 is 400**. As we specified the hidden layer size as 25, **the size of the layer2 is 25**. We have 10 output classes. So, **the size of layer3 is 10**.

The shape of theta1: 400 x 25

The shape of theta2: 25 x 10

In the same way, there will be two randomly initialized bias terms b1 and b2.

The shape of b1: the size of layer2 (in this case 25)

The shape of b2: the size of layer3 (in this case 10)

Define a function for randomly initializing theta:

`def randInitializeWeights(Lin, Lout):`

epi = (6**1/2) / (Lin + Lout)**0.5

w = np.random.rand(Lout, Lin)*(2*epi) -epi

return w

Use this function and initialize the theta

`hidden_layer = 25`

output =10

theta1 = randInitializeWeights(len(df.T), hidden_layer)

theta2 = randInitializeWeights(hidden_layer, output)

theta = [theta1, theta2]

Now, initialize the bias terms as we discussed above:

`b1 = np.random.randn(25,)`

b2 = np.random.randn(10,)

**5. Implement Forward propagation**

Use the formulas in the forward propagation section.

Define a function to multiply theta and X for convenience

`def z_calc(X, theta):`

return np.dot(X, theta.T)

We will use the activation function a few times as well. It will be good to have a function for sigmoid activation as well

`def sigmoid(z):`

return 1/(1+ np.exp(-z))

I will demonstrate the forward propagation step by step now. First, calculate the z term:

`z1 =z_calc(df, theta1) + b1`

Now pass this z1 through the activation function to get our hidden layer

`a1 = sigmoid(z1)`

a1 is the hidden layer. The shape of the a1 is 5000 x 25. Repeat the same process to calculate layer3 or the output layer

`z2 = z_calc(a1, theta2) + b2`

a2 = sigmoid(z2)

The shape of the a2 is 5000 x 10. 10 columns are for 10 classes. a2 is our layer3 or final output or hypothesis. If there were more hidden layers in this example, there would be more repetition of the same process to transfer from one layer to another. This process of calculating the output layer using the input feature is called forward propagation. Putting it all together in a function, so we can perform forward propagation for any number of layers:

`l = 3 #the umber of layers`

b = [b1, b2]

def hypothesis(df, theta):

a = []

z = []

for i in range (0, l-1):

z1 = z_calc(df, theta[i]) + b[i]

out = sigmoid(z1)

a.append(out)

z.append(z1)

df = out

return out, a, z

**6. Implement Backpropagation**

This is the process of going backward to calculate the gradient and update the theta. Before that, we need to modify the ‘y’. We have 10 classes in ‘y’. But we need to segregate each class in its column. For example, the first column is for class 10. We will replace 1 for 10 and 0 for the rest of the classes. This way we will make an individual column for each class.

`y1 = np.zeros([len(df), len(y_arr)])`

y1 = pd.DataFrame(y1)

for i in range(0, len(y_arr)):

for j in range(0, len(y1)):

if y[0][j] == y_arr[i]:

y1.iloc[j, i] = 1

else:

y1.iloc[j, i] = 0

y1.head()

Now the way I demonstrated forward propagation step by step first and then put all in a function, I will do the same for backpropagation. Using the formula for gradients in the backpropagation section above, calculate delta3 first. We will use z1, z2, a1, and a2 from the forward propagation implementation.

`del3 = y1-a2`

Now use this formula to calculate delta2:

Here is delta2:

`del2 = np.dot(del3, theta2) * a1*(1 - a1)`

Here we need to learn a new concept. That is a sigmoid gradient. The formula for the sigmoid gradient is:

If you notice, this is exactly the same as **a(1 — a) **in the formula for delta. because a is sigmoid(z). Because this is a convention, I will replace this sigmoid gradient instead of a(1-a) term in the formula for delta2 when I will put them all together to write the function. **They are exactly the same. I just wanted to demonstrate both**. Let’s write a function for the sigmoid gradient:

`def sigmoid_grad(z):`

return sigmoid(z)*(1 - sigmoid(z))

Finally, this is the time to update the theta using this formula:

We need to choose a learning rate. I chose 0.003. I encourage you to try with other learning rates to see how it performs:

`theta1 = np.dot(del2.T, pd.DataFrame(a1)) * 0.003`

theta2 = np.dot(del3.T, pd.DataFrame(a2)) * 0.003

This is how theta needs to be updated. This process called backpropagation because it moves backward. Before writing the function for backpropagation, we need to define the cost function. Because I will include the calculation of cost in the backpropagation method as well. Though it could be added in the forward propagation or you can keep it separate while training the network. Here is the method for the cost function

`def cost_function(y, y_calc, l):`

return (np.sum(np.sum(-np.log(y_calc)*y - np.log(1-y_calc)*(1-y))))/m

Here m is the number of training examples. Putting it all together:

`m = len(df)`

def backpropagation(df, theta, y1, alpha):

out, a, z = hypothesis(df, theta)

delta = []

delta.append(y1-a[-1])

i = l - 2

while i > 0:

delta.append(np.dot(delta[-i], theta[-i])*sigmoid_grad(z[-(i+1)]))

i -= 1

theta[0] = np.dot(delta[-1].T, df) * alpha

for i in range(1, len(theta)):

theta[i] = np.dot(delta[-(i+1)].T, pd.DataFrame(a[0])) * alpha

out, a, z = hypothesis(df, theta)

cost = cost_function(y1, a[-1], 1)

return theta, cost

**7. Train the Network**

I will train the network for 20 epochs. I will initialize the theta again in this code snippet. Because I already used the theta and updated it. So, if I do not initialize it again, I will end up starting with the updated theta already. But I want to take a fresh start.

`theta1 = randInitializeWeights(len(df.T), hidden_layer)`

theta2 = randInitializeWeights(hidden_layer, output)

theta = [theta1, theta2]

cost_list = []

for i in range(20):

theta, cost= backpropagation(df, theta, y1, 0.003)

cost_list.append(cost)

cost_list

I used the learning rate of 0.003 and ran it for 20 epochs. But please look at the GitHub link provided below. I tried with different learning rates and a different number of epochs to finally reach here.

We got the list of costs that we calculated in each epoch and also the final updated theta. Use this final theta to predict the output.

**8. Predict the output and calculate the accuracy**

Simply use the hypothesis function to and pass this updated theta to predict the output:

`out, a, z = hypothesis(df, theta)`

Now calculate the accuracy,

`accuracy= 0`

for i in range(0, len(out)):

for j in range(0, len(out[i])):

if out[i][j] >= 0.5 and y1.iloc[i, j] == 1:

accuracy += 1

accuracy/len(df)

The accuracy is 100%. Perfect, right? But we do not get 100% accuracy all the time. Sometimes getting a 70% accuracy is great, depending on the dataset.

Congrats! You just developed a complete neural network! This same problem is solved using a logistic regression algorithm in this article:

Here is the GitHub link for the full working code:

More Reading Recommendation:

Multivariate Linear Regression in Python Step by Step

Polynomial Regression From Scratch in Python

Learning Curve To Improve A Machine Learning Algorithm

Build A Complete Neural Network From Scratch in Python

Logistic Regression with Python Using Optimization Function

How to Perform Hypothesis Testing in Python For the Population Proportion

and For the Population Mean.

How to Present the Relationships Amongst Multiple Variables in Python