Note: Medium still doesn’t render mathematical equations correctly, so if you want to see the details, check out my post here .

We’re going to need some data to train on. Thankfully `sklearn`

has a number of different data sets to work with.

We’ve split our data into test and training groups, now we need to get the dimensions of our data as parameters for setting up our network. Proper dimensions are critical for the numerous linear algebra operations we have to execute.

n_inputs = X_train.shape[0] n_inputs_dim = X_train.shape[1]
Some of the features of our network are set by our data, but others are open for design. We’re going to make a small, 2-layer neural network (one hidden layer, one output layer). We can set the number of hidden nodes in the hidden layer as we like (we’ll play with this parameter later). Additionally, because our moon-shaped data is a binary classifier, let’s set the number of output nodes to 1. We’ll have a sigmoid activation function on the output layer which will allow us to interpret the output probabalistically.

# Layer size n_h = 4 # Number of hidden nodes n_out = 1 # Number of output nodes = for binary classifier
With our network dimensions set, we need to go ahead and initialize our parameters. The bias terms can be initialized to 0, but we can’t do the same with the weights. We need to initialize them randomly to ensure that we break the symmetry of the values. Think about it like this, if all the weights were initialized to 0, then there would be no signal that would propagate forward through the network. If all the weights were the same (e.g. 1), the same signal magnitude would be propagated forward. For our initialization, we’ll use the Xavier Initialization formula. This is an initialization that is designed to break symmetry and lead to faster training. The version I’m using follows the implementation found in the mxnet documentation and is given as:

c = \sqrt{\frac{3}{\frac{1}{2}(n_{in} + n_{out})}}

The weights are then selected from a uniform distribution from [-c, c].

c = np.sqrt(3 / (0.5 + n_inputs_dim + n_out)) # Initialize weights and bias W1 = np.random.uniform(low=-c, high=c, size=(n_inputs_dim, n_h)) b1 = np.zeros((1, n_h)) W2 = np.random.uniform(low=-c, high=c, size=(n_h, n_out)) b2 = np.zeros((1, n_out))
Activation Function
For our activation function, we’ll select exponential linear units or ELU’s to train with. These are simple functions that have been found to offer better classification properties (i.e. faster and more accurate training) than other activation functions.

ELU(x) = \begin{cases} x, & \text{if} x >0 \\ a \big(e^x -1 \big), & \text{if} x \leq 0 \end{cases}

The value a is another hyperparamater here and is greater than 0 (a>0). For our purposes, we’ll select a=2 as the default, but feel free to play with it to see if you can improve results.

# Activation function: Exponential Linear Unit (ELU) def Elu(x, a=2): """ Compute the ELU output of x """ return np.where(x<=0, a * (np.exp(x) - 1), x)
For our output layer, we’ll use the sigmoid of logistic function to classify our results probabalistically.

\sigma(x) = \frac{1}{1 + e^{-x}}

def sigmoid(x): """ Compute sigmoid of array x """ return 1 / (1 + np.exp(-x))
Forward Propagation
Now that we have our network dimensions defined, have selected activation functions, and initialized our weights and biases, we can proceed with the forward propagation step. This is simply a series of linear transformations followed by the application of the activation function until we reach the output layer.

Z1 = np.dot(X_train, W1) + b1 A1 = Elu(Z1) Z2 = np.dot(A1, W2) + b2 A2 = sigmoid(Z2)
Our value `A2`

is our network prediction. It gives the probability that our network places on each data point belonging to each class. To see how well it is performing, we can calculate the cross-entropy loss which takes the log probabilities and compares them to the actual data labels (y).

The cross-entropy loss formula is given as:

L = — \frac{1}{m} \sum^m_{i=0} \big(y^{(i)}log(A_2^{(i)}) + (1 — y^{(i)})log(1 — A_2^{(i)}) \big)

Y_train = Y_train.reshape(-1,1) log_probs = (np.multiply(np.log(A2), Y_train) + np.multiply(np.log(1 - A2), (1 - Y_train))) loss = -1 / n_inputs * np.sum(log_probs)
Checking the loss value:

loss 0.6395828151201088
Back Propagation
In order to implement gradient descent with back propagation, we need to roll up our sleeves and do a bit of calculus. In the previous post , I walked through the details, so feel free to go back to that if you need a refresher or want to see the derivation. Here, we’re just going to take those results and apply them to our model.

We need the derivative for all of the weights and biases in our network. This is done by working backwards from the loss function and propagating the results throughout the network. Writing the equations for our network, we have:

First Layer: Z_1 = W_1 \cdot X + b_1
First Layer Activation: A_1 = Elu(Z_1)
Output Layer: Z_2 = W_2 \cdot A_1 + b_2
Output Layer Activation: A_2 = \sigma(Z_2)
Loss: L(A2, y) = -\big(y log(A_2) + (1 — y) log(1 — A_2) \big)
To calculate our back propagation values, we need \frac{dL}{dA_2} which is:

\frac{\partial L}{\partial A_2} = -\frac{y}{A_2} + \frac{1-y}{1-A_2}

Working backwards, the partial derivatives of each of our subsequent layers are:

Output Layer Activation Derivative (`dZ2`

): \frac{\partial L}{\partial Z_2} = A_2 – y
Output Layer Derivatives (`dW2`

; `db2`

): \frac{\partial L}{\partial W_2} = (A_2 – y)A_1; \frac{\partial L}{\partial b_2} = A_2 – y
First Layer Activation Derivative (`dZ1`

): \frac{\partial L}{\partial Z_1} = W_2 (A_2 – y) g_1′(Z_1)
First Layer Derivatives (`dW1`

; `db1`

): \frac{\partial L}{\partial W_1} = W_2 (A_2 – y) g_1′(Z_1)X; \frac{\partial L}{\partial b_1} = W_2 (A_2 – y) g_1′(Z_1)
Where g_1(Z_1) is our first layer activation function which is defined as the ELU function. Thus, g_1′(Z_1) is:

g_1′(Z_1) = \frac{d}{dx}ELU(x) = \begin{cases}1, & \text{if} x > 0 \ ae^x, & \text{if} x \leq 0 \end{cases}

Notice that a number of values from our forward propagation pass are utilized here for the back propagation step, so we just need to be sure not to overwrite those values until we have applied our updates.

def dElu(x, a=2): return np.where(x<=0,a * np.exp(x), 0)
Additionally, in the code, we’ll regularize our back propagation step by dividing by the number of examples which will help smooth things out during training. We’ll call this value `m`

.

# Calculate back propagation m = 1 / n_inputs dZ2 = A2 - Y_train dW2 = m * np.dot(A1.T, dZ2) db2 = m * np.sum(dZ2, axis=0, keepdims=True) dZ1 = m * np.dot(dZ2, W2.T) * dElu(Z1) dW1 = m * np.dot(X_train.T, dZ1) db1 = m * np.sum(dZ1, axis=0, keepdims=True)
Once we have calculated all of the derivatives, we can update our parameters using gradient descent. This is done simply by changing the parameter in question by the product of the step-size parameter or learning rate (\alpha) and the derivative we just calculated. In general then, we have:

\theta := \theta — \alpha \frac{\partial L}{\partial \theta}

Where \theta are our network weights and biases.

learning_rate = 0.01 W2 -= learning_rate * dW2 b2 -= learning_rate * db2 W1 -= learning_rate * dW1 b1 -= learning_rate * db1
After the update, we can run another forward propagation pass, check our loss, and update our weights accordingly. It’s really just a repetition of the process that we’ve just outlined over and over again until we reach convergence with a trained network or run through the pre-defined number of steps to train the network.

Putting it all Together
Because it could take hundreds or thousands or — in some cases — millions of iterations to train the network, we’ll take the code above and put into a class with a number of training and testing methods to tie this together in a compact way and see how it performs on our data. You can get the code on GitHub here or copy it from my post here .

With our `network`

class defined, we only need to pass it our training data, initialize the weights and train it.

net = network(X_train, Y_train) net.initialize(n_hidden=3) training_loss = net.train(learning_rate=0.1, log_loss=True) plt.figure(figsize=(12,8)) plt.plot(training_loss) plt.title("Network Training Loss") plt.xlabel("Number of Training Iterations") plt.ylabel("Cross-Entropy Loss") plt.show()
print("Training Accuracy: %.2f" %net.train_accuracy()) Training Accuracy: 0.89
It seems to be working well! Let’s see how well it predicts our test data.

pred = net.predict(X_test) np.sum(Y_test.reshape(-1,1)==pred) / len(Y_test) 0.84799999999999998
85% classification accuracy isn’t too bad and it was (hopefully) fairly straight-forward to follow along with my implementation. Another interesting thing to visualize is the decision boundary for our data. To do this, we can pass our network a grid of points and see where it thinks the cut-off for the classification is. If you noticed, I already built this method into the `network`

class and we can simply call it once we have trained our network to visualize the results.

net.plot_decision_boundary()
We can also see how some of the hyper parameters such as the number of hidden nodes affects the results.

n_hidden = [2, 5, 10, 25, 50] # Loop through hidden layer settings for n in n_hidden: net = network(X, Y) net.initialize(n_hidden=n) net.train(learning_rate=0.1, log_loss=False) print("%d Hidden Nodes" %n) net.plot_decision_boundary() 2 Hidden Nodes
5 Hidden Nodes
10 Hidden Nodes
25 Hidden Nodes
50 Hidden Nodes
As you can see, the network gradually gets better and better at classifying our shapes. Typically, as we increase the number of hidden nodes or network layers, we run the risk of overfitting our data. There are methods to deal with this such as drop-out , but that will be for another post when we’re dealing with a much larger network.

There are a few things you can try on your own such as adjusting the network’s activation function to see if you can get it to learn faster, or maybe another will lead to overfitting. Remember to check the derivative and apply the new function to both the forward and backward propagation passes. You could also adjust the loss function. Something like a softmax function for the output layer is generalizable to multi-class problems and takes a different number of output nodes. It also introduces additional complexity as the softmax cross-entropy loss function has different regularization terms. Finally, we only have a two-layer neural network (not counting the input layer) which doesn’t really qualify as deep learning — although the precise definition is a bit nebulous. Try adding layers. It’s really the same sort of procedure, just more terms for backward and forward propagation to take care of.

That’s it! You’ve got a neural network from nothing but `numpy`

! Hopefully you’re comfortable with the concepts and ready to go deeper into the topic and utilize frameworks such as TensorFlow to build these models. As always, leave some comments or let me know if there are any questions!