Source: Deep Learning on Medium

How to implement neural networks from scratch with Python

**Neural network** is a stack of **neurons** that takes in some value and outputs some value. Given enough number of neurons, neural networks are incredibly good at **mapping** **x** to **y**. During **supervised learning**, we use this property to learn a **function** that maps **x** to **ŷ**. To learn that function, we need data. There are two types of data: **structured** and **unstructured**. Structured data are well organized into arrays and often comes from a database. Unstructured data are things like audio, image, text, and etc. that does not come in typical structure.

**Binary classifier** predicts the **class** of an input **x** by computing **ŷ **(**0** or **1 **for two classes). We use a** neuron **that maps **x** to **ŷ = σ(Wᵀx + b)**. g = σ(z)** **is called a** sigmoid function **and computes** σ(z) = 1 / (1 + e －ᶻ)**. Sigmoid function outputs a value between 0 and 1, which happens to work beautifully for our purpose of binary classification.

In Python:

import numpy as np

def sigmoid(z):

a = 1 / (1 + np.exp(-z))

return a

def neuron(W, x, b):

z = np.dot(np.transpose(W), x) + b

y_hat = sigmoid(z)

return y_hat

To use a 64x64x3 image as an input to our neuron, we need to flatten the image into a (64x64x3)x1 vector. And to make **Wᵀx + b **output a single value** z**, we need** W **to be a (64x64x3)x1 vector: (dimension of input)x(dimension of output), and **b** to be a single value. With **N** number of images, we can make a matrix **X** of shape (64x64x3)x**N**. **WᵀX + b **outputs **Z** of shape 1x**N** containing **z**’s for every single sample, and by passing **Z** through a sigmoid function we get final **ŷ **of shape** **1x**N **that contains predictions for every single sample. We do not have to explicitly create a **b** of 1x**N** with the same value copied **N** times, thanks to Python **broadcasting**.

In Python:

train_x, train_y, test_x, test_y = load_data()

print(train_x.shape)

(209, 64, 64, 3)

def flatten(X):

return np.transpose(X.reshape(X.shape[0], -1))

train_x = flatten(train_x)

print(train_x.shape)

(12288, 50)

**y** is our given **target **or** label**. During **supervised learning**, our goal is to try to learn the exact mapping from **x** to **y**. In other words, we want to figure out a mapping function that would give us **prediction** **ŷ **as close as possible to **target** **y**. We use **loss function** to measure how close **ŷ **is to** y**. We could simply use the distance or** **use** **something better like **logarithmic/cross-entropy loss**:** -(y log ŷ + (1-y)log(1-ŷ)**. Now, we can simplify our goal to minimizing the loss function. When we have multiple** ŷ**’s** **and** y**’s from multiple **x**’s, we take the average of **loss** to get **cost**.

In Python:

def cross_entropy_loss(y_hat, y):

if y == 0:

return -np.log(y_hat)

else:

return -np.log(1-y_hat)

We use **gradient descent** to achieve our goal of finding the **W** and **b** that minimize the loss function, which is **L = -(y log σ(Wᵀx + b) + (1-y)log(1-σ(Wᵀx + b)))**. To do this, we compute **L**, then derive **dL/dW** and **dL/db **to** **use them for updating **W** and **b **so that they would give a better **L** when we compute it with the same **x **next time. The derivatives provide the directions, while **learning rate** **α** determines the magnitude of the changes we make.** W := W-α﹒dL/dW **and** b:= b-α﹒dL/db **is the **update rule **we use**. **Using the **chain rule**, we can easily derive **dL/dW** and **dL/db. **In deep learning, we call the chain of computations up to **L**, the **forward propagation**, and the following chain of computations to **dL/dW** and **dL/db, **the** backward propagation**. **Computation graphs **helps us** **visualize the chains. Finally, **logistic regression** is simply repeating the process of our forward propagation, backward propagation, and update parameters, to find **W** and **b** that maps **x** to **ŷ **with minimum **logarithmic loss** compared to **y**.

In Python:

# forward propagation

y_hat = neuron(W, x, b)

loss = cross_entropy_loss(y_hat, y)

# backward propagation

dz = y_hat - y

dW = np.dot(dz, x.T)

db = dz

# update parameters

W -= learning_rate * dW

b -= learning_rate * db

What happens to **dL/dW**(**dW **from here on) and **dL/db**(**db **from here on) when we have multiple x’s? As mentioned before with **N** number of **x**’s, we create a **n𝑥 **by** N **matrix** X **that has **x’**s as columns. Similarly, its target **Y** is a 1x**N** row vector with **N** number of **y**’s stacked horizontally. Forward propagation of **X **computes **Ŷ** with the same shape as** Y **and their **cost** by taking an average over multiple losses. Likewise, we will have to take an average over multiple **dW**’s and **db**’s. With 1x**N** **Ŷ** and **Y**,** dZ **will be 1x**N **as well,** **carrying** dz**’s** **for** **N samples. Taking the dot product between 1x**N dZ **and **N**x**n𝑥 X.T **results in 1x**n𝑥** row vector just like **W**. Even with just one sample, when dz is 1×1 and x.T is 1x**n𝑥**, the dot product between the two outputs 1x**n𝑥**. Then what’s the difference? With **N** number of samples, each element in the row vector becomes a sum of **dW**’s from **N** samples, due to dot product. To take an average, all we have to do is dividing the 1x**n𝑥 **row vector by **N**. Unfortunately and obviously, there is no Python broadcasting for dot products. To mimic the operation with **dW **for **db**, I prefer to take a dot product between 1x**N dZ** and **N**x1 vector of ones, then divide the resulting number with **N** to take an average. But in the course, we take the sum of **dZ** over **N** samples and divide it by **N**. I prefer my way for better logic.

In Python:

# forward propagation

Y_hat = neuron(W, X, b)

cost = np.sum(cross_entropy_loss(y_hat, Y)) / N

# backward propagation

dZ = Y_hat - Y

dW = np.dot(dZ, X.T) / N

db = np.dot(dZ, np.ones((N, 1))) / N

# update parameters

W -= learning_rate * dW

b -= learning_rate * db

The beauty of the **forward **and** backward propagation **is that it works seamlessly even when we stack bunch of **neurons** to make a **neural network**. For **input layer** is just a placeholder for inputs and does not compute anything, we do not count it as **neurons**. So, in our **logistic regression model**, we only had a single **neuron **in** output layer**. To turn it into a **neural network binary classifier**, we** **add** **hidden layers in between the **input** and **output layers**. Each **hidden layer **have multiple number of **neurons**, and each** neuron**’s** **output is called an **activation**. **Activations** from a **hidden layer** become **inputs** for the next layer. The **activation** of the **neuron** in **output layer** is **ŷ**. And it’s **activation function** that wraps **Wᵀx + b **is a **sigmoid function**. Likewise, every **neuron** **in** **hidden layers **(**hidden units**)** **computes **linear** operation **Wᵀx + b **then pass it to a **nonlinear** **activation function **to output an **activation**. **ReLU** **function** is a better choice of **activation function** for **hidden units **because unlike in the **output layer**, we do not need to squeeze **activations **from** hidden units **to be in between **0** and **1**. Additionally, **ReLU**(**z**) = max(0, **z**); therefore, it does not suffer from having very small gradients at the ends, helping the** neural network** to learn much faster. **Activation functions** have to be **nonlinear **because composition of **linear **functions only result in another **linear **function, which is are not complex enough to solve complex problems. Repeatedly computing activations individually takes a lot of for-loops that makes things run very slow. So, we use **vectorization** to summarize for-loops into one matrix dot product. This is not a new concept for us because we already used it to make matrix **X** out of multiple **x**’s by just stacking them as columns. Likewise, we will stack multiple **W**’s for each units in the same layer as columns and transpose it to get the vectorized **W1**. The shape of **W1 **is** number of units in current layer **x** number of units from previous layer**. With vectorized **b1** and **X**, **W1X + b1** will give **number of units** x **N** shaped **Z1**. And, passing it through activation function will give **A1** with the same shape. A1 is passed onto the next layer (output layer), which computes **W2A1 + b2 = Z2** and **A2 = σ(Z2)**.

In Python:

def relu(z):

return np.maximum(0, z)

# forward propagation

# hidden layer 1

Z1 = np.dot(W1, X) + b1

A1 = relu(Z1)

# output layer

Z2 = np.dot(W2, A1) + b2

A2 = sigmoid(Z2)

y_hat = A2

For **backward propagation**, the derivative of **a = sigmoid(z)** is **a(1-a)**, and the derivative of **a = ReLU(z)** is **0 if z < 0 **and** 1 if z ≥ 0**. Let’s take a look at the **computation graph** of a **neural network** with a **single** **hidden layer**.

In Python:

# backward propagation

# output layer

dZ2 = A2 - Y

dW2 = np.dot(dZ2, A1.T) / N

db2 = np.dot(dZ2, np.ones(N, 1)) / N

dA1 = np.dot(W2.T, dZ2)

# hidden layer 1

dZ1 = dA1 * (Z1 >= 0)

dW1 = np.dot(dZ1, X.T) / N

db1 = np.dot(dZ1, np.ones(N, 1)) / N

# update parameters

W2 -= learning_rate * dW2

b2 -= learning_rate * db2

W1 -= learning_rate * dW1

b1 -= learning_rate * db1

As mentioned, **gradient descent** is an iterative process that steps toward the point with minimum loss. To find **W**’s and **b**’s that give the minimum loss, we need to set the starting point. If we initialize the parameters (**W**’s and **b**’s) with zeros, all the **hidden units** will compute the same **activations** and during **backward propagation**, get the same **gradients**. Basically, all the **hidden units **will become symmetric and keep computing the same function no matter how may iterations. So, we initialize the parameters with random number. When using **sigmoid** as **activation functions**, it is important to keep the numbers small because gradients are nearly 0 at the ends.

In Python:

def initialize_parameters_single_hidden_layer():

W1 = np.random.rand(n_units[1], n_units[0]) * 0.01

b1 = np.zeros(n_units[1], 1)

W2 = np.random.rand(n_units[2], n_units[1]) * 0.01

b2 = np.zeros(n_units[2], 1)

return W1, b1, W2, b2

def initialize_parameters_logistic_regression():

W = np.random.rand(n_units[1], n_units[0]) * 0.01

b = np.zeros(n_units[1], 1)

return W, b

By putting everything together, we can build a **binary classifier**. First, we **load data** and **initialize parameters**, then repeat **forward-backward propagation** and **parameter updates**. We have seen in detail what goes inside each neurons from a **logistic regression** model and a **single hidden layer neural network model**. As mentioned in the beginning, we can learn any x to y mapping function given enough number of neurons. Therefore, **deep networks** with multiple **hidden layers **can learn functions that the shallow ones cannot. The **hidden layer 2** will take **A1** in and output **A2**, and pass it onto the **hidden layer 3**. Repeating the process until the output layer is reached.The deeper the layer is the higher level features it learns. If we have a face recognizer, the first layer will figure out how to detect edges, the next layer will learn how to group those edges to detect eyes, ears, nose, and etc. And the later layer will learn how to group those informations to recognize different faces. Let’s build a **deep neural network**.

In Python:

def initialize_parameters(n_units):

parameters = {}

for l in range(1, len(n_units)):

parameters['W'+str(l)] = np.random.rand(n_units[l],

n_units[l-1]) * 0.01

parameters['b'+str(l)] = np.zeros(n_units[l], 1)

return parameters

def propagate(X, parameters, Y, L):

N = X.shape[1]

# forward

A = {}

Z = {}

A['0'] = X

for l in range(1, L):

Z[str(l)] = np.dot(parameters['W'+str(l)], A[str(l-1)]) +

parameters['b'+str(l)])

A[str(l)] = relu(Z[str(l)])

Z[str(L)] = np.dot(parameters['W'+str(L)], A[str(L-1)]) +

parameters['b'+str(L)]

A[str(L)] = sigmoid(Z[str(L)])

loss = cross_entropy_loss(A[str(L)], Y)

cost = np.mean(loss)

# backward

dA = {}

dZ = {}

gradients = {}

dZ[str(L)] = A[str(L)] - Y

gradients['dW'+str(L)] = np.dot(dZ[str(L)], A[str(L-1)].T) / N

gradients['db'+str(L)] = np.dot(dZ[str(L)].T, np.ones(N, 1)) / N

for l in reversed(range(1, L)):

dA[str(l)] = np.dot(parameters['W'+str(l+1)].T,

dZ[str(l+1)])

gradients['dW'+str(l)] = np.dot(dZ[str(l+1)], A[str(l)].T)

/ N

gradients['db'+str(l)] = np.dot(dZ[str(l+1)].T,

np.ones(N, 1)) / N

return gradients

def update_parameters(parameters, gradients, learning_rate, L):

for l in range(1, L+1):

parameters['W'+str(l)] -= learning_rate *

gradients['dW'+str(l)]

parameters['b'+str(l)] -= learning_rate *

gradients['db'+str(l)]

return parameters

train_X, train_Y, test_X, test_Y, classes = load_data()

nx = train_X.shape[0]

n_units = [nx, 5, 5, 5, 3, 1]

L = len(n_units) - 1

n_epochs = 1000

learning_rate = 0.01

parameters = initialize_parameters(n_units)

for _ in range(n_epochs):

gradients = propagate(X, parameters, Y, L)

parameters = update_parameters(parameters, gradients,

learning_rate, L)

Learning rate, number of epochs (number of iterations), number of hidden units, number of hidden layers, choice of activation functions, etc. are **hyper-parameters** that control the character of **parameters** W and b. We can build some insights but the only way to figure out good **hyper-parameters** is by experimenting.