Bread and Butter from Deep Learning by Andrew Ng — Course 1: Neural Networks and Deep Learning

Source: Deep Learning on Medium

How to implement neural networks from scratch with Python

Neural network is a stack of neurons that takes in some value and outputs some value. Given enough number of neurons, neural networks are incredibly good at mapping x to y. During supervised learning, we use this property to learn a function that maps x to ŷ. To learn that function, we need data. There are two types of data: structured and unstructured. Structured data are well organized into arrays and often comes from a database. Unstructured data are things like audio, image, text, and etc. that does not come in typical structure.

Binary classifier predicts the class of an input x by computing ŷ (0 or 1 for two classes). We use a neuron that maps x to ŷ = σ(Wᵀx + b). g = σ(z) is called a sigmoid function and computes σ(z) = 1 / (1 + e -ᶻ). Sigmoid function outputs a value between 0 and 1, which happens to work beautifully for our purpose of binary classification.

In Python:

import numpy as np
def sigmoid(z):
a = 1 / (1 + np.exp(-z))
return a
def neuron(W, x, b):
z =, x) + b
y_hat = sigmoid(z)
return y_hat

To use a 64x64x3 image as an input to our neuron, we need to flatten the image into a (64x64x3)x1 vector. And to make Wᵀx + b output a single value z, we need W to be a (64x64x3)x1 vector: (dimension of input)x(dimension of output), and b to be a single value. With N number of images, we can make a matrix X of shape (64x64x3)xN. WᵀX + b outputs Z of shape 1xN containing z’s for every single sample, and by passing Z through a sigmoid function we get final ŷ of shape 1xN that contains predictions for every single sample. We do not have to explicitly create a b of 1xN with the same value copied N times, thanks to Python broadcasting.

In Python:

train_x, train_y, test_x, test_y = load_data()
(209, 64, 64, 3)
def flatten(X):
return np.transpose(X.reshape(X.shape[0], -1))
train_x = flatten(train_x)
(12288, 50)

y is our given target or label. During supervised learning, our goal is to try to learn the exact mapping from x to y. In other words, we want to figure out a mapping function that would give us prediction ŷ as close as possible to target y. We use loss function to measure how close ŷ is to y. We could simply use the distance or use something better like logarithmic/cross-entropy loss: -(y log ŷ + (1-y)log(1-ŷ). Now, we can simplify our goal to minimizing the loss function. When we have multiple ŷ’s and y’s from multiple x’s, we take the average of loss to get cost.

In Python:

def cross_entropy_loss(y_hat, y):
if y == 0:
return -np.log(y_hat)
return -np.log(1-y_hat)

We use gradient descent to achieve our goal of finding the W and b that minimize the loss function, which is L = -(y log σ(Wᵀx + b) + (1-y)log(1-σ(Wᵀx + b))). To do this, we compute L, then derive dL/dW and dL/db to use them for updating W and b so that they would give a better L when we compute it with the same x next time. The derivatives provide the directions, while learning rate α determines the magnitude of the changes we make. W := W-α﹒dL/dW and b:= b-α﹒dL/db is the update rule we use. Using the chain rule, we can easily derive dL/dW and dL/db. In deep learning, we call the chain of computations up to L, the forward propagation, and the following chain of computations to dL/dW and dL/db, the backward propagation. Computation graphs helps us visualize the chains. Finally, logistic regression is simply repeating the process of our forward propagation, backward propagation, and update parameters, to find W and b that maps x to ŷ with minimum logarithmic loss compared to y.

In Python:

# forward propagation
y_hat = neuron(W, x, b)
loss = cross_entropy_loss(y_hat, y)
# backward propagation
dz = y_hat - y
dW =, x.T)
db = dz
# update parameters
W -= learning_rate * dW
b -= learning_rate * db

What happens to dL/dW(dW from here on) and dL/db(db from here on) when we have multiple x’s? As mentioned before with N number of x’s, we create a n𝑥 by N matrix X that has x’s as columns. Similarly, its target Y is a 1xN row vector with N number of y’s stacked horizontally. Forward propagation of X computes Ŷ with the same shape as Y and their cost by taking an average over multiple losses. Likewise, we will have to take an average over multiple dW’s and db’s. With 1xN Ŷ and Y, dZ will be 1xN as well, carrying dz’s for N samples. Taking the dot product between 1xN dZ and Nxn𝑥 X.T results in 1xn𝑥 row vector just like W. Even with just one sample, when dz is 1×1 and x.T is 1xn𝑥, the dot product between the two outputs 1xn𝑥. Then what’s the difference? With N number of samples, each element in the row vector becomes a sum of dW’s from N samples, due to dot product. To take an average, all we have to do is dividing the 1xn𝑥 row vector by N. Unfortunately and obviously, there is no Python broadcasting for dot products. To mimic the operation with dW for db, I prefer to take a dot product between 1xN dZ and Nx1 vector of ones, then divide the resulting number with N to take an average. But in the course, we take the sum of dZ over N samples and divide it by N. I prefer my way for better logic.

In Python:

# forward propagation
Y_hat = neuron(W, X, b)
cost = np.sum(cross_entropy_loss(y_hat, Y)) / N
# backward propagation
dZ = Y_hat - Y
dW =, X.T) / N
db =, np.ones((N, 1))) / N
# update parameters
W -= learning_rate * dW
b -= learning_rate * db

The beauty of the forward and backward propagation is that it works seamlessly even when we stack bunch of neurons to make a neural network. For input layer is just a placeholder for inputs and does not compute anything, we do not count it as neurons. So, in our logistic regression model, we only had a single neuron in output layer. To turn it into a neural network binary classifier, we add hidden layers in between the input and output layers. Each hidden layer have multiple number of neurons, and each neuron’s output is called an activation. Activations from a hidden layer become inputs for the next layer. The activation of the neuron in output layer is ŷ. And it’s activation function that wraps Wᵀx + b is a sigmoid function. Likewise, every neuron in hidden layers (hidden units) computes linear operation Wᵀx + b then pass it to a nonlinear activation function to output an activation. ReLU function is a better choice of activation function for hidden units because unlike in the output layer, we do not need to squeeze activations from hidden units to be in between 0 and 1. Additionally, ReLU(z) = max(0, z); therefore, it does not suffer from having very small gradients at the ends, helping the neural network to learn much faster. Activation functions have to be nonlinear because composition of linear functions only result in another linear function, which is are not complex enough to solve complex problems. Repeatedly computing activations individually takes a lot of for-loops that makes things run very slow. So, we use vectorization to summarize for-loops into one matrix dot product. This is not a new concept for us because we already used it to make matrix X out of multiple x’s by just stacking them as columns. Likewise, we will stack multiple W’s for each units in the same layer as columns and transpose it to get the vectorized W1. The shape of W1 is number of units in current layer x number of units from previous layer. With vectorized b1 and X, W1X + b1 will give number of units x N shaped Z1. And, passing it through activation function will give A1 with the same shape. A1 is passed onto the next layer (output layer), which computes W2A1 + b2 = Z2 and A2 = σ(Z2).

In Python:

def relu(z):
return np.maximum(0, z)
# forward propagation
# hidden layer 1
Z1 =, X) + b1
A1 = relu(Z1)
# output layer
Z2 =, A1) + b2
A2 = sigmoid(Z2)
y_hat = A2

For backward propagation, the derivative of a = sigmoid(z) is a(1-a), and the derivative of a = ReLU(z) is 0 if z < 0 and 1 if z ≥ 0. Let’s take a look at the computation graph of a neural network with a single hidden layer.

In Python:

# backward propagation
# output layer
dZ2 = A2 - Y
dW2 =, A1.T) / N
db2 =, np.ones(N, 1)) / N
dA1 =, dZ2)
# hidden layer 1
dZ1 = dA1 * (Z1 >= 0)
dW1 =, X.T) / N
db1 =, np.ones(N, 1)) / N
# update parameters
W2 -= learning_rate * dW2
b2 -= learning_rate * db2
W1 -= learning_rate * dW1
b1 -= learning_rate * db1

As mentioned, gradient descent is an iterative process that steps toward the point with minimum loss. To find W’s and b’s that give the minimum loss, we need to set the starting point. If we initialize the parameters (W’s and b’s) with zeros, all the hidden units will compute the same activations and during backward propagation, get the same gradients. Basically, all the hidden units will become symmetric and keep computing the same function no matter how may iterations. So, we initialize the parameters with random number. When using sigmoid as activation functions, it is important to keep the numbers small because gradients are nearly 0 at the ends.

In Python:

def initialize_parameters_single_hidden_layer():
W1 = np.random.rand(n_units[1], n_units[0]) * 0.01
b1 = np.zeros(n_units[1], 1)
W2 = np.random.rand(n_units[2], n_units[1]) * 0.01
b2 = np.zeros(n_units[2], 1)

return W1, b1, W2, b2
def initialize_parameters_logistic_regression():
W = np.random.rand(n_units[1], n_units[0]) * 0.01
b = np.zeros(n_units[1], 1)
 return W, b

By putting everything together, we can build a binary classifier. First, we load data and initialize parameters, then repeat forward-backward propagation and parameter updates. We have seen in detail what goes inside each neurons from a logistic regression model and a single hidden layer neural network model. As mentioned in the beginning, we can learn any x to y mapping function given enough number of neurons. Therefore, deep networks with multiple hidden layers can learn functions that the shallow ones cannot. The hidden layer 2 will take A1 in and output A2, and pass it onto the hidden layer 3. Repeating the process until the output layer is reached.The deeper the layer is the higher level features it learns. If we have a face recognizer, the first layer will figure out how to detect edges, the next layer will learn how to group those edges to detect eyes, ears, nose, and etc. And the later layer will learn how to group those informations to recognize different faces. Let’s build a deep neural network.

In Python:

def initialize_parameters(n_units):
parameters = {}
for l in range(1, len(n_units)):
parameters['W'+str(l)] = np.random.rand(n_units[l],
n_units[l-1]) * 0.01
parameters['b'+str(l)] = np.zeros(n_units[l], 1)

return parameters
def propagate(X, parameters, Y, L):
N = X.shape[1]
 # forward
A = {}
Z = {}
A['0'] = X
for l in range(1, L):
Z[str(l)] =['W'+str(l)], A[str(l-1)]) +
A[str(l)] = relu(Z[str(l)])
Z[str(L)] =['W'+str(L)], A[str(L-1)]) +
A[str(L)] = sigmoid(Z[str(L)])

loss = cross_entropy_loss(A[str(L)], Y)
cost = np.mean(loss)
 # backward
dA = {}
dZ = {}
gradients = {}
 dZ[str(L)] = A[str(L)] - Y
gradients['dW'+str(L)] =[str(L)], A[str(L-1)].T) / N
gradients['db'+str(L)] =[str(L)].T, np.ones(N, 1)) / N
for l in reversed(range(1, L)):
dA[str(l)] =['W'+str(l+1)].T,
gradients['dW'+str(l)] =[str(l+1)], A[str(l)].T)
/ N
gradients['db'+str(l)] =[str(l+1)].T,
np.ones(N, 1)) / N
 return gradients
def update_parameters(parameters, gradients, learning_rate, L):
for l in range(1, L+1):
parameters['W'+str(l)] -= learning_rate *
parameters['b'+str(l)] -= learning_rate *

return parameters
train_X, train_Y, test_X, test_Y, classes = load_data()
nx = train_X.shape[0]
n_units = [nx, 5, 5, 5, 3, 1]
L = len(n_units) - 1
n_epochs = 1000
learning_rate = 0.01
parameters = initialize_parameters(n_units)
for _ in range(n_epochs):
gradients = propagate(X, parameters, Y, L)
parameters = update_parameters(parameters, gradients,
learning_rate, L)

Learning rate, number of epochs (number of iterations), number of hidden units, number of hidden layers, choice of activation functions, etc. are hyper-parameters that control the character of parameters W and b. We can build some insights but the only way to figure out good hyper-parameters is by experimenting.