Deep Learning 101 — Building a Neural Network from the Ground Up — DataHubbs

Source: Deep Learning on Medium

Deep Learning 101 — Building a Neural Network from the Ground Up — DataHubbs

In the last post, we walked through the theory behind deep learning and introduced key concepts like backpropagation and gradient descent. Here, we take a more hands-on approach and show how this can all be implemented using Python and Numpy. I’ll show where the theory comes in here as we build a simple neural network architecture for prediction, so familiarity with the concepts discussed previously will be helpful. As I have mentioned previously on this blog, I always find that having some understanding the guts of the algorithm (even if your understanding is not perfect) greatly helps when working in data science and machine learning. You can typically understand where a given algorithm might perform better or worse, trouble shoot better, and grasp the limitations and potential more readily. Additionally, familiarizing yourself with such a low-level implementation is also helpful when you move to other, higher-level frameworks like TensorFlow or Keras to build an understanding of what is going on there.

Note: Medium still doesn’t render mathematical equations correctly, so if you want to see the details, check out my post here.

TL;DR

We build a neural network from scratch using nothing by Python and the Numpy package. I walk through the architecture step-by-step and explicitly call out the what, why, and how.

Getting Started

If you’re following along at home, fire up your favorite Python IDE and get going! We’ll build a network line by line from scratch to make a prediction from some fake-data we can generate using scikit-learn and then, build a few helper functions around this to make our predictions.
Without further ado, let’s get to importing our packages.

import numpy as np 
from sklearn.datasets import make_moons
from sklearn.model_selection import train_test_split
import matplotlib.pyplot as plt

We’re going to need some data to train on. Thankfully sklearn has a number of different data sets to work with.

np.random.seed(0) X, Y = make_moons(500, noise=0.1) 
# Split into test and training data
X_train, X_test, Y_train, Y_test = train_test_split(
X, Y, test_size=0.25, random_state=73)
# Plot results
plt.figure(figsize=(12,8))
plt.scatter(X_train[:,0], X_train[:,1], c=Y_train,
cmap=plt.cm.cividis, s=50)
plt.xlabel('X1')
plt.ylabel('X2')
plt.title('Random Training Data')
plt.show()

We’ve split our data into test and training groups, now we need to get the dimensions of our data as parameters for setting up our network. Proper dimensions are critical for the numerous linear algebra operations we have to execute.

n_inputs = X_train.shape[0] 
n_inputs_dim = X_train.shape[1]

Some of the features of our network are set by our data, but others are open for design. We’re going to make a small, 2-layer neural network (one hidden layer, one output layer). We can set the number of hidden nodes in the hidden layer as we like (we’ll play with this parameter later). Additionally, because our moon-shaped data is a binary classifier, let’s set the number of output nodes to 1. We’ll have a sigmoid activation function on the output layer which will allow us to interpret the output probabalistically.

# Layer size 
n_h = 4
# Number of hidden nodes
n_out = 1
# Number of output nodes = for binary classifier

With our network dimensions set, we need to go ahead and initialize our parameters. The bias terms can be initialized to 0, but we can’t do the same with the weights. We need to initialize them randomly to ensure that we break the symmetry of the values. Think about it like this, if all the weights were initialized to 0, then there would be no signal that would propagate forward through the network. If all the weights were the same (e.g. 1), the same signal magnitude would be propagated forward. For our initialization, we’ll use the Xavier Initialization formula. This is an initialization that is designed to break symmetry and lead to faster training. The version I’m using follows the implementation found in the mxnet documentation and is given as:

c = \sqrt{\frac{3}{\frac{1}{2}(n_{in} + n_{out})}}

The weights are then selected from a uniform distribution from [-c, c].

c = np.sqrt(3 / (0.5 + n_inputs_dim + n_out)) 
# Initialize weights and bias
W1 = np.random.uniform(low=-c, high=c, size=(n_inputs_dim, n_h))
b1 = np.zeros((1, n_h))
W2 = np.random.uniform(low=-c, high=c, size=(n_h, n_out))
b2 = np.zeros((1, n_out))

Activation Function

For our activation function, we’ll select exponential linear units or ELU’s to train with. These are simple functions that have been found to offer better classification properties (i.e. faster and more accurate training) than other activation functions.

ELU(x) = \begin{cases} x, & \text{if} x >0 \\ a \big(e^x -1 \big), & \text{if} x \leq 0 \end{cases}

The value a is another hyperparamater here and is greater than 0 (a>0). For our purposes, we’ll select a=2 as the default, but feel free to play with it to see if you can improve results.

# Activation function: Exponential Linear Unit (ELU) 
def Elu(x, a=2):
""" Compute the ELU output of x """
return np.where(x<=0, a * (np.exp(x) - 1), x)

For our output layer, we’ll use the sigmoid of logistic function to classify our results probabalistically.

\sigma(x) = \frac{1}{1 + e^{-x}}

def sigmoid(x):
""" Compute sigmoid of array x """
return 1 / (1 + np.exp(-x))

Forward Propagation

Now that we have our network dimensions defined, have selected activation functions, and initialized our weights and biases, we can proceed with the forward propagation step. This is simply a series of linear transformations followed by the application of the activation function until we reach the output layer.

Z1 = np.dot(X_train, W1) + b1 
A1 = Elu(Z1)
Z2 = np.dot(A1, W2) + b2
A2 = sigmoid(Z2)

Our value A2 is our network prediction. It gives the probability that our network places on each data point belonging to each class. To see how well it is performing, we can calculate the cross-entropy loss which takes the log probabilities and compares them to the actual data labels (y).

The cross-entropy loss formula is given as:

L = — \frac{1}{m} \sum^m_{i=0} \big(y^{(i)}log(A_2^{(i)}) + (1 — y^{(i)})log(1 — A_2^{(i)}) \big)

Y_train = Y_train.reshape(-1,1) 
log_probs = (np.multiply(np.log(A2), Y_train) +
np.multiply(np.log(1 - A2), (1 - Y_train)))
loss = -1 / n_inputs * np.sum(log_probs)

Checking the loss value:

loss
0.6395828151201088

Back Propagation

In order to implement gradient descent with back propagation, we need to roll up our sleeves and do a bit of calculus. In the previous post, I walked through the details, so feel free to go back to that if you need a refresher or want to see the derivation. Here, we’re just going to take those results and apply them to our model.

We need the derivative for all of the weights and biases in our network. This is done by working backwards from the loss function and propagating the results throughout the network. Writing the equations for our network, we have:

  • First Layer: Z_1 = W_1 \cdot X + b_1
  • First Layer Activation: A_1 = Elu(Z_1)
  • Output Layer: Z_2 = W_2 \cdot A_1 + b_2
  • Output Layer Activation: A_2 = \sigma(Z_2)
  • Loss: L(A2, y) = -\big(y log(A_2) + (1 — y) log(1 — A_2) \big)

To calculate our back propagation values, we need \frac{dL}{dA_2} which is:

\frac{\partial L}{\partial A_2} = -\frac{y}{A_2} + \frac{1-y}{1-A_2}

Working backwards, the partial derivatives of each of our subsequent layers are:

  • Output Layer Activation Derivative (dZ2): \frac{\partial L}{\partial Z_2} = A_2 – y
  • Output Layer Derivatives (dW2; db2): \frac{\partial L}{\partial W_2} = (A_2 – y)A_1; \frac{\partial L}{\partial b_2} = A_2 – y
  • First Layer Activation Derivative (dZ1): \frac{\partial L}{\partial Z_1} = W_2 (A_2 – y) g_1′(Z_1)
  • First Layer Derivatives (dW1; db1): \frac{\partial L}{\partial W_1} = W_2 (A_2 – y) g_1′(Z_1)X; \frac{\partial L}{\partial b_1} = W_2 (A_2 – y) g_1′(Z_1)

Where g_1(Z_1) is our first layer activation function which is defined as the ELU function. Thus, g_1′(Z_1) is:

g_1′(Z_1) = \frac{d}{dx}ELU(x) = \begin{cases}1, & \text{if} x > 0 \ ae^x, & \text{if} x \leq 0 \end{cases}

Notice that a number of values from our forward propagation pass are utilized here for the back propagation step, so we just need to be sure not to overwrite those values until we have applied our updates.

def dElu(x, a=2): 
return np.where(x<=0,a * np.exp(x), 0)

Additionally, in the code, we’ll regularize our back propagation step by dividing by the number of examples which will help smooth things out during training. We’ll call this value m.

# Calculate back propagation 
m = 1 / n_inputs
dZ2 = A2 - Y_train dW2 = m * np.dot(A1.T, dZ2)
db2 = m * np.sum(dZ2, axis=0, keepdims=True)
dZ1 = m * np.dot(dZ2, W2.T) * dElu(Z1)
dW1 = m * np.dot(X_train.T, dZ1)
db1 = m * np.sum(dZ1, axis=0, keepdims=True)

Once we have calculated all of the derivatives, we can update our parameters using gradient descent. This is done simply by changing the parameter in question by the product of the step-size parameter or learning rate (\alpha) and the derivative we just calculated. In general then, we have:

\theta := \theta — \alpha \frac{\partial L}{\partial \theta}

Where \theta are our network weights and biases.

learning_rate = 0.01 
W2 -= learning_rate * dW2
b2 -= learning_rate * db2
W1 -= learning_rate * dW1
b1 -= learning_rate * db1

After the update, we can run another forward propagation pass, check our loss, and update our weights accordingly. It’s really just a repetition of the process that we’ve just outlined over and over again until we reach convergence with a trained network or run through the pre-defined number of steps to train the network.

Putting it all Together

Because it could take hundreds or thousands or — in some cases — millions of iterations to train the network, we’ll take the code above and put into a class with a number of training and testing methods to tie this together in a compact way and see how it performs on our data. You can get the code on GitHub here or copy it from my post here.

With our network class defined, we only need to pass it our training data, initialize the weights and train it.

net = network(X_train, Y_train) 
net.initialize(n_hidden=3)
training_loss = net.train(learning_rate=0.1, log_loss=True)
plt.figure(figsize=(12,8))
plt.plot(training_loss)
plt.title("Network Training Loss")
plt.xlabel("Number of Training Iterations")
plt.ylabel("Cross-Entropy Loss")
plt.show()
print("Training Accuracy: %.2f" %net.train_accuracy())
Training Accuracy: 0.89

It seems to be working well! Let’s see how well it predicts our test data.

pred = net.predict(X_test) 
np.sum(Y_test.reshape(-1,1)==pred) / len(Y_test) 0.84799999999999998

85% classification accuracy isn’t too bad and it was (hopefully) fairly straight-forward to follow along with my implementation. Another interesting thing to visualize is the decision boundary for our data. To do this, we can pass our network a grid of points and see where it thinks the cut-off for the classification is. If you noticed, I already built this method into the network class and we can simply call it once we have trained our network to visualize the results.

net.plot_decision_boundary()

We can also see how some of the hyper parameters such as the number of hidden nodes affects the results.

n_hidden = [2, 5, 10, 25, 50] 
# Loop through hidden layer settings
for n in n_hidden:
net = network(X, Y)
net.initialize(n_hidden=n)
net.train(learning_rate=0.1, log_loss=False)
print("%d Hidden Nodes" %n)
net.plot_decision_boundary()
2 Hidden Nodes
5 Hidden Nodes
10 Hidden Nodes
25 Hidden Nodes
50 Hidden Nodes

As you can see, the network gradually gets better and better at classifying our shapes. Typically, as we increase the number of hidden nodes or network layers, we run the risk of overfitting our data. There are methods to deal with this such as drop-out, but that will be for another post when we’re dealing with a much larger network.

There are a few things you can try on your own such as adjusting the network’s activation function to see if you can get it to learn faster, or maybe another will lead to overfitting. Remember to check the derivative and apply the new function to both the forward and backward propagation passes. You could also adjust the loss function. Something like a softmax function for the output layer is generalizable to multi-class problems and takes a different number of output nodes. It also introduces additional complexity as the softmax cross-entropy loss function has different regularization terms. Finally, we only have a two-layer neural network (not counting the input layer) which doesn’t really qualify as deep learning — although the precise definition is a bit nebulous. Try adding layers. It’s really the same sort of procedure, just more terms for backward and forward propagation to take care of.

That’s it! You’ve got a neural network from nothing but numpy! Hopefully you’re comfortable with the concepts and ready to go deeper into the topic and utilize frameworks such as TensorFlow to build these models. As always, leave some comments or let me know if there are any questions!