AI: Taking A Peek Under The Hood. Part 2, Creating a Two-Layer Neural Network the Old-Fashioned Way

Original article was published on Artificial Intelligence on Medium

AI: Taking A Peek Under The Hood. Part 2, Creating a Two-Layer Neural Network the Old-Fashioned Way

Matthew Henry

Before diving into the new content for this part of the series, a quick recap of what was covered in Part 1 of this series follows to refresh continuing readers and provide context for new readers. As discussed, AI is just math. The mathematical algorithms powering all of the AI functions we see today are Neural Networks.

Neural Networks come in three general forms:

  1. Standard Neural Networks (generally used for binary classification and point-estimation or regression-type problems).
  2. Convolutional Neural Networks (generally used in image recognition).
  3. Recurrent Neural Networks (generally applied to problems involving sequence data such as text and speech recognition, understanding, or prediction).

Today we will be constructing a Standard Neural Network with two layers (one hidden layer, and an output layer). The mathematics will be explained and python code will be provided. This will equip you to not only understand exactly how a Standard Neural Network works, but will also empower you to build them on your own.

To begin, let’s take a quick look at the steps involved before unpacking each step. Note we will not be covering data wrangling, cleaning, or transforming as this can vary widely between datasets and applications. That said, it is worth pointing out that standardizing your input data is important for relationship identification, as well as minimizing computing costs.

The Steps:

  1. Define the model structure (such as number of input features)
  2. Initialize the model’s parameters
  3. Calculate the current loss (aka “forward propagation”)
  4. Calculate the current gradient (aka “backward propagation”)
  5. Update parameters (gradient descent)
  6. Predict on hold-out data

Steps 3–5 are performed in a loop that is dependent on the number of iterations we perform. The number of iterations we perform is a decision we make when defining the Model Structure in Step 1. The diagram below illustrates this process.

Step 1: Define the Model Structure

Defining your model structure includes both the type of Neural Network you are going to build, as well as the hyperparameters of the Neural Network. The hyperparameters are the number of layers, the number of nodes per layer, the activation functions in each layer, alpha (the learning rate — this is important for gradient descent as it determines how fast we update our parameters), and the number of iterations. These hyperparameters are what will influence our parameters “w”, and “b” (weight and bias) which are calculated within each node of the network. The weight and bias are in turn inputs to the activation function, which is applied in each node of the network.

This part is both an art and a science. Data scientists with a lot of experience in a certain domain will gain intuition about what set of hyperparameters work best for their business problems, but this is generally not transferable across problems and domains and requires manual trial and error. There are plenty of “rules-of-thumb” you can Google to give you a good idea on starting points, but this will not eliminate the need to try different combinations to come up with the best model for your business problem.

One thing to keep in mind is that, as discussed in Part 1 of this series, Deep Neural Networks tend to work better, and can also reduce the overall number of nodes required in the network. To illustrate why a Deep Neural Network will usually work better, let’s take the example of facial recognition. In building a Neural Network to perform facial recognition you will want several layers to break up the task into manageable chunks. The first hidden layer will focus simpler, smaller tasks such as detecting edges. The second hidden layer will build upon those edges to detect facial parts (such as a nose, eyes, lips, etc.). The third layer will put those pieces together to identify an entire face. Finally, your output layer will tell you if it is the face you are looking for (as a probability). This is why sometimes your phone or computer recognizes you, and sometimes it does not. The software is programmed to unlock your device if the algorithm determines with a certain probability that it is in fact you. This probability output by the network changes depending on lighting, the angle you face your screen, if you are wearing glasses or not, etc.

That said, let’s start defining the structure of our Neural Network.

In the code below we will define 3 variables:

  • n_x: the size of the input layer
  • n_h: the size of the hidden layer (how many neurons are in the hidden layer — we will set this to be 4 neurons)
  • n_y: the size of the output layer
def layer_sizes(X, Y):
"""
Arguments:
X -- input dataset of shape (input size, number of examples)
Y -- labels of shape (output size, number of examples)

Returns:
n_x -- the size of the input layer
n_h -- the size of the hidden layer
n_y -- the size of the output layer
"""

n_x = X.shape[0] # size of input layer
n_h = 4 #neurons in hidden layer
n_y = Y.shape[0] # size of output layer

return (n_x, n_h, n_y)

Step 2: Initialize Model Parameters

Initializing our model parameters means initializing our parameters “w” and “b” to start out as some value before we begin iterating through the process of calculating and optimizing them within each node to identify the true relationship between our input data and predicted (output) variable.

The weights (“w”) will be initialized as very small random values, while the biases (“b”) will be initialized as 0s. The weights we want to be small values so that they start out close to the center of the tanh or sigmoid function, which speeds up learning. If we initialized them as large values, then we would start our optimization near the tails of the functions where there is very little slope in the curve, and this would slow down our gradient descent (optimization) later on. The reason they need to be random is so that they are not all the same. The nodes must compute different functions in order for them to be useful. Click here for a visual that compares the sigmoid, tanh, ReLU and Leaky ReLU functions.

From these images, you can see that being near the tails of the tanh or sigmoid functions (which would result from large values) would mean there is almost no slope. Initializing with small values does not always eliminate this problem, which is one of the reasons why the ReLU function can improve processing time.

Let’s take a look at the code:

def initialize_parameters(n_x, n_h, n_y):
"""
Argument:
n_x -- size of the input layer
n_h -- size of the hidden layer
n_y -- size of the output layer

Returns:
params -- python dictionary containing your parameters:
W1 -- weight matrix of shape (n_h, n_x)
b1 -- bias vector of shape (n_h, 1)
W2 -- weight matrix of shape (n_y, n_h)
b2 -- bias vector of shape (n_y, 1)
"""

np.random.seed(2) #set up a seed so that your output is consistent

W1 = np.random.randn(n_h,n_x)* 0.01
b1 = np.zeros((n_h,1))
W2 = np.random.randn(n_y,n_h) * 0.01
b2 = np.zeros((n_y,1))

assert (W1.shape == (n_h, n_x))
assert (b1.shape == (n_h, 1))
assert (W2.shape == (n_y, n_h))
assert (b2.shape == (n_y, 1))

parameters = {"W1": W1,
"b1": b1,
"W2": W2,
"b2": b2}

return parameters

Note that the assert statements are there to ensure our weight and bias matrices are the correct dimensions according to the specifications of our Neural Network (the details of which are not covered in this article).

Step 3: Forward Propagation

Forward propagation is when we move through the network from the input layer to the output layer, calculating several important values within each node as we go. Within each node we will perform the following calculations:

  • Z = w^T x + b (w transpose times x + b)
  • And calculating the activation function, a = σ(Z) (using a sigmoid function in this example)

Rather than performing this step specifically as a loop in Python it is important to vectorize it to speed up processing. To do so, we will compute Z as a vector by:

  1. Taking all of our W’s and stacking them into a matrix (each W is a vector but we transpose each of them, and each of these transposed vectors becomes a single row. Therefore stacking them together forms a matrix),
  2. then simply adding our b vector to the W matrix.

In terms of selecting your activation function, this really depends on the problem you are trying to solve. However, in the hidden layers, you may want to select a function that speeds up processing (e.g. tanh, or ReLU), and in the output layer, you need to select an activation function that is relevant to the business problem. For example, if you are doing binary classification, a sigmoid function makes sense as it will give you a value between 0 and 1, then you simply need to set your threshold of acceptance (does anything >= 0.5 equal a 1 or is it some other value?).

One final note before looking at the code. In the optimization (gradient descent) section, we will need to be able to calculate the derivative (slope) of our activation function. Therefore, we will store these values as ‘cache’ during forward propagation.

def forward_propagation(X, parameters):
"""
Argument:
X -- input data of size (n_x, m)
parameters -- python dictionary containing your parameters (output of initialization function)

Returns:
A2 -- The sigmoid output of the second activation
cache -- a dictionary containing "Z1", "A1", "Z2" and "A2"
"""
# Retrieve each parameter from the dictionary "parameters"

W1 = parameters["W1"]
b1 = parameters["b1"]
W2 = parameters["W2"]
b2 = parameters["b2"]

# Implement Forward Propagation to calculate A2 (probabilities)

Z1 = np.dot(W1,X) + b1
A1 = np.tanh(Z1)
Z2 = np.dot(W2,A1) + b2
A2 = sigmoid(Z2)

assert(A2.shape == (1, X.shape[1]))

cache = {"Z1": Z1,
"A1": A1,
"Z2": Z2,
"A2": A2}

return A2, cache

Having calculated the weights, biases, and activation values through one forward propagation pass, we need to calculate the cost (the loss averaged across all training examples). Keep in mind loss is essentially error (the difference between a predicted value and the true value).

This is what we will be minimizing by iterating through forward and backward passes, by iteratively generating new values, calculating cost, then updating the values in our network until the model converges or we meet the number of iterations we specify.

def compute_cost(A2, Y, parameters):
"""

Arguments:
A2 -- The sigmoid output of the second activation, of shape (1, number of examples)
Y -- "true" labels vector of shape (1, number of examples)
parameters -- python dictionary containing your parameters W1, b1, W2 and b2

Returns:
cost -- cross-entropy cost

"""

m = Y.shape[1] # number of example


# Compute the cross-entropy cost

logprobs = np.multiply(np.log(A2),Y) + np.multiply(np.log(1 - A2),1 - Y)
cost = - np.sum(logprobs) * (1 / m)

cost = float(np.squeeze(cost)) # makes sure cost is the dimension we expect.

assert(isinstance(cost, float))

return cost

Step 4: Backward Propagation

When doing back-propagation we need to be able to calculate the derivative of the activation function. For any given value of Z, the function will have some slope corresponding to that value. The goal is to use gradient descent to find the global minimum of a convex function. To read more about gradient descent click here.

What we will be calculating are the following derivatives (slopes):

  • dZ[2] = A[2]-Y
  • dW[2] = 1/m dZ[2] A[2]T
  • db[2] = 1/m np.sum(dZ[2], axis =1, keepdims = True)
  • dZ[1] = W[2]T dZ[2] * g[1]’ (Z[1])
  • dW[1] = 1/m dZ[1] XT
  • db[1] = 1/m np.sum(dZ[1], axis =1, keepdims = True)

Note [2] refers to the second layer of the neural network and [1] refers to the first layer (in this case this is our hidden layer) and ‘T’ represents a transpose. As you can see, we start at the end of the Neural Network, then work our way back to the beginning (hence the term ‘backward propagation’).

def backward_propagation(parameters, cache, X, Y):
"""

Arguments:
parameters -- python dictionary containing our parameters
cache -- a dictionary containing "Z1", "A1", "Z2" and "A2".
X -- input data of shape (2, number of examples)
Y -- "true" labels vector of shape (1, number of examples)

Returns:
grads -- python dictionary containing your gradients with respect to different parameters
"""
m = X.shape[1]

# First, retrieve W1 and W2 from the dictionary "parameters".

W1 = parameters["W1"]
W2 = parameters["W2"]

# Retrieve also A1 and A2 from dictionary "cache".

A1 = cache["A1"]
A2 = cache["A2"]

# Backward propagation: calculate dW1, db1, dW2, db2.

dZ2 = A2 - Y
dW2 = 1 / m *(np.dot(dZ2,A1.T))
db2 = 1 / m *(np.sum(dZ2,axis = 1,keepdims = True))
dZ1 = np.dot(W2.T,dZ2) * (1 - np.power(A1, 2))
dW1 = 1 / m *(np.dot(dZ1,X.T))
db1 = 1 / m *(np.sum(dZ1,axis = 1,keepdims = True))

grads = {"dW1": dW1,
"db1": db1,
"dW2": dW2,
"db2": db2}

return grads

We now will need to update our values W1, b1, W2, and b2 using the derivatives we just calculated. As mentioned in Part 1 of this series, this is the “learning” part of deep-learning. This update is done by taking the vale determined in the forward propagation less the learning rate (a hyperparameter of our choosing) times the gradient (or slope) of the line from the current iteration determined above.

def update_parameters(parameters, grads, learning_rate = 1.2):
"""

Arguments:
parameters -- python dictionary containing your parameters
grads -- python dictionary containing your gradients

Returns:
parameters -- python dictionary containing your updated parameters
"""
# Retrieve each parameter from the dictionary "parameters"

W1 = parameters["W1"]
b1 = parameters["b1"]
W2 = parameters["W2"]
b2 = parameters["b2"]

# Retrieve each gradient from the dictionary "grads"

dW1 = grads["dW1"]
db1 = grads["db1"]
dW2 = grads["dW2"]
db2 = grads["db2"]

# Update rule for each parameter

W1 = W1 - learning_rate * dW1
b1 = b1 - learning_rate * db1
W2 = W2 - learning_rate * dW2
b2 = b2 - learning_rate * db2

parameters = {"W1": W1,
"b1": b1,
"W2": W2,
"b2": b2}

return parameters

Before predicting, let’s put these functions together to create our Standard Neural Network model. We will set the number of iterations to 10,000 in this example.

def nn_model(X, Y, n_h, num_iterations = 10000, print_cost=False):
"""
Arguments:
X -- dataset of shape (2, number of examples)
Y -- labels of shape (1, number of examples)
n_h -- size of the hidden layer
num_iterations -- Number of iterations in gradient descent loop
print_cost -- if True, print the cost every 1000 iterations

Returns:
parameters -- parameters learnt by the model. They can then be used to predict.
"""

np.random.seed(3)
n_x = layer_sizes(X, Y)[0]
n_y = layer_sizes(X, Y)[2]

# Initialize parameters

parameters = initialize_parameters(n_x,n_h,n_y)
W1 = parameters["W1"]
b1 = parameters["b1"]
w2 = parameters["W2"]
b2 = parameters["b2"]

# Loop (gradient descent)

for i in range(0, num_iterations):

# Forward propagation. Inputs: "X, parameters". Outputs: "A2, cache".
A2, cache = forward_propagation(X,parameters)

# Cost function. Inputs: "A2, Y, parameters". Outputs: "cost".
cost = compute_cost(A2, Y, parameters)

# Backpropagation. Inputs: "parameters, cache, X, Y". Outputs: "grads".
grads = backward_propagation(parameters, cache, X, Y)

# Gradient descent parameter update. Inputs: "parameters, grads". Outputs: "parameters".
parameters = update_parameters(parameters, grads)

# Print the cost every 1000 iterations
if print_cost and i % 1000 == 0:
print ("Cost after iteration %i: %f" %(i, cost))


return parameters

Step 6: Prediction

To predict on hold-out data, you will run a forward propagation of the optimized model. In this example, we are predicting a binary classification, where any value returned by the activation in our output layer that is greater than 0.5 we classify as a 1, and therefore any value less than or equal to 0.5 is classified as a 0.

def predict(parameters, X):
"""
Using the learned parameters, predicts a class for each example in X

Arguments:
parameters -- python dictionary containing your parameters
X -- input data of size (n_x, m)

Returns
predictions -- vector of predictions of our model (red: 0 / blue: 1)
"""

# Computes probabilities using forward propagation, and classifies to 0/1 using 0.5 as the threshold.

A2, cache = forward_propagation(X,parameters)
predictions = (A2 > 0.5)


return predictions

Finally, you can check the accuracy of your Neural Network as follows:

predictions = predict(parameters, X)

print ('Accuracy: %d' % float((np.dot(Y,predictions.T) + np.dot(1-Y,1-predictions.T))/float(Y.size)*100) + '%')

Concluding Remarks:

In the content above we covered the theory, mathematics, and code to construct a Standard Neural Network with two layers (one hidden layer, and an output layer). I hope that this has helped enhance your understanding of the AI applications you encounter on a daily basis, while providing you with a strong starting point to build your own AI application for a binary classification problem. There is a huge amount of great content available online where you can dive deeper into many of the topics discussed. I personally learnt all of this from Andrew Ng’s Neural Networks and Deep Learning course on Coursera, which you can find here.