Understanding Convolution Neural Networks -Part II

Source: Deep Learning on Medium

Understanding Convolution Neural Networks -Part II

This article is a continuation of Part I. If you haven’t read Part I, I would strongly advise you to read it

In Part II, we would be building the network describe in Figure 1.1

Let’s define some terms which we would be using in this article.

f = filter size

n_filers = number of filters

p = padding

s = stride

m = # of training examples

Figure 1.1 Convolution Neural Network

Data

We will be using digit signs data set which contains images of shape 64 x 64 x 3. The training set consists of 1080 images and test set consists of 120 images. Number of classes are 6 which contains digit signs 0, 1, 2, 3, 4, 5.

Figure 1.2 Image showing 4 digit sign.
Figure 1.3 Image showing 4 digit sign.
Figure 1.4 Image showing 5 digit sign.

We would make use of concepts that we have gone through in Part I to build a convolution neural network from scratch.

Convolutional Neural Network

The network as described in Figure 1.1 consists of convolution layer, relu layer, max pool layer, flatten layer, dense layer followed by a softmax activation function since we have more than one class. We would make use of Adam optimizer for training out network. In all the layers weights were initialised using He initialisation [He et. al. 2015] as the network is ReLU activated.

Forward Propagation

Convolution Layer

The hyper parameters of convolution layer are:

  • p is 2
  • s is 2
  • f is 3 x 3
  • n_filters is 10
Figure 1.5 Single step of convolution

Forward propagation in convolution layer consists of three steps :

  • Pad zeros to image with amount p
def zero_pad(self, X, pad):
"""
Set padding to the image X.

Pads with zeros all images of the dataset X.
Zeros are added around the border of an image.

Parameters:
X -- Image -- numpy array of shape (m, n_H, n_W, n_C)
pad -- padding amount -- int

Returns:
X_pad -- Image padded with zeros around width and height. -- numpy array of shape (m, n_H + 2*pad, n_W + 2*pad, n_C)

"""
X_pad = np.pad(X, ((0, 0), (pad, pad), (pad, pad), (0, 0)), 'constant')
return X_pad
  • Get image window based on s
def get_corners(self, height, width, filter_size, stride):
"""
Get corners of the image relative to stride.

Parameters:
height -- height of an image -- int
width -- width of an image -- int
filter_size -- size of filter -- int
stride -- amount by which the filter shifts -- int

Returns:
vert_start -- a scalar value, upper left corner of the box.
vert_end -- a scalar value, upper right corner of the box.
horiz_start -- a scalar value, lower left corner of the box.
horiz_end -- a scalar value, lower right corner of the box.

"""
vert_start = height * stride
vert_end = vert_start + filter_size
horiz_start = width * stride
horiz_end = horiz_start + filter_size
return vert_start, vert_end, horiz_start, horiz_end
  • Apply convolution operation doing element wise product of image window with f.
def convolve(self, image_slice, W, b):
"""
Apply a filter defined by W on a single slice of an image.

Parameters:
image_slice -- slice of input data -- numpy array of shape (f, f, n_C_prev)
W -- Weight parameters contained in a window - numpy array of shape (f, f, n_C_prev)
b -- Bias parameters contained in a window - numpy array of shape (1, 1, 1)

Returns:
Z -- a scalar value, result of convolving the sliding window (W, b) on image_slice

"""
s = np.multiply(image_slice, W)
z = np.sum(s)
Z = z + float(b)
return Z

Bringing it all together, the forward propagation in convolution layer would be the following over complete training data.

def forward(self, A_prev):
"""
Forward proporgation for convolution.

This takes activations from previous layer and then convolve it
with a filter defined by W with bias b.

Parameters:
A_prev -- output activations of the previous layer, numpy array of shape (m, n_H_prev, n_W_prev, n_C_prev)

Returns:
Z -- convolution output, numpy array of shape (m, n_H, n_W, n_C)

"""
np.random.seed(self.seed)
self.A_prev = A_prev
filter_size, filter_size, n_C_prev, n_C = self.params[0].shape
m, n_H_prev, n_W_prev, n_C_prev = self.A_prev.shape
Z = np.zeros((m, self.n_H, self.n_W, self.n_C))
A_prev_pad = self.zero_pad(self.A_prev, self.pad)

for i in range(m):
a_prev_pad = A_prev_pad[i, :, :, :]
for h in range(self.n_H):
for w in range(self.n_W):
for c in range(n_C):
vert_start, vert_end, horiz_start, horiz_end = self.get_corners(h, w, self.filter_size, self.stride)
a_slice_prev = a_prev_pad[vert_start:vert_end, horiz_start:horiz_end, :]
Z[i, h, w, c] = self.convolve(
a_slice_prev, self.params[0][:, :, :, c], self.params[1][:, :, :, c])
assert (Z.shape == (m, self.n_H, self.n_W, self.n_C))
return Z

Output of convolution layer would be of shape (m, 33, 33, 10). The general formula to calculate the height and width of output of convolution layer :

Figure 1.6 Output height and width of convolution layer.

where n is the input dimension. This is useful in checking for the matrix shapes during implementation. Output shape then becomes (m, height, width, number of channels), where number of channels equals to n_filters.

Relu layer

Forward propogation in ReLU

def forward(self, Z):
"""
Forward propogation of relu layer.

Parameters:
Z -- Input data -- numpy array of shape (m, n_H_prev, n_W_prev, n_C_prev)

Returns:
A -- Activations of relu layer-- numpy array of shape m, n_H_prev, n_W_prev, n_C_prev)

"""
self.Z = Z
A = np.maximum(0, Z) # element-wise
return A

Input to relu layer is the output of convolution layer. relu layer does not change matrix dimensions so the output shape remains the same.

Maxpool Layer

The hyper parameters of max pool are :

Forward propagation in maxpool layer consists of two steps :

  • Get input window based on s
def get_corners(self, height, width, filter_size, stride):
"""
Get corners of the image relative to stride.

Parameters:
height -- height of an image -- int
width -- width of an image -- int
filter_size -- size of filter -- int
stride -- amount by which the filter shifts -- int

Returns:
vert_start -- a scalar value, upper left corner of the box.
vert_end -- a scalar value, upper right corner of the box.
horiz_start -- a scalar value, lower left corner of the box.
horiz_end -- a scalar value, lower right corner of the box.

"""
vert_start = height * stride
vert_end = vert_start + filter_size
horiz_start = width * stride
horiz_end = horiz_start + filter_size
return vert_start, vert_end, horiz_start, horiz_end
  • Apply Maxpool operation on input window and forward propagation over complete training data.
def forward(self, A_prev):
"""
Forward prpogation of the pooling layer.

Arguments:
A_prev -- Input data, numpy array of shape (m, n_H_prev, n_W_prev, n_C_prev)

Returns:
Z -- output of the pool layer, a numpy array of shape (m, n_H, n_W, n_C)

"""
self.A_prev = A_prev
m, n_H_prev, n_W_prev, n_C_prev = self.A_prev.shape
Z = np.empty((m, self.n_H, self.n_W, n_C_prev))
for i in range(m):
a_prev = self.A_prev[i]
for h in range(self.n_H):
for w in range(self.n_W):
for c in range(self.n_C):
vert_start, vert_end, horiz_start, horiz_end = self.get_corners(
h, w, self.filter_size, self.stride)
#if horiz_end <= a_prev.shape[1] and vert_end <= a_prev.shape[0]:
a_slice_prev = a_prev[
vert_start:vert_end, horiz_start:horiz_end, c]
Z[i, h, w, c] = np.max(a_slice_prev)
assert(Z.shape == (m, self.n_H, self.n_W, n_C_prev))
return Z

Output of Maxpool layer would be of shape (m, 32, 32, 10). The general formula to calculate the height and width of output of maxpool layer is shown in Figure 1.6. Output shape then becomes (m, height, width, number of channels), where number of channels is the last axis of input dimension.

Flatten layer

Forward propagation in flatten layer

def forward(self, A_prev):
"""
Forward propogation of flatten layer.

Parameters:
A_prev -- input data -- numpy of array shape (m, n_H_prev, n_W_prev, n_C_prev)

Returns:
Z -- flatten numpy array of shape (m, n_H_prev * n_W_prev * n_C_prev)

"""
np.random.seed(self.seed)
self.A_prev = A_prev
output = np.prod(self.A_prev.shape[1:])
m = self.A_prev.shape[0]
self.out_shape = (self.A_prev.shape[0], -1)
Z = self.A_prev.ravel().reshape(self.out_shape)
assert (Z.shape == (m, output))
return Z

Output shape is (m, 10240).

Dense Layer

This is a fully connected neural network layer.

def forward(self, A_prev):
"""
Forward propogation of Dense layer.

Parameters:
A_prev -- input data -- numpy of array shape (m, input_dim)

Returns:
Z -- flatten numpy array of shape (m, output_dim)

"""
np.random.seed(self.seed)
m = A_prev.shape[0]
self.A_prev = A_prev
Z = np.dot(self.A_prev, self.params[0]) + self.params[1]
assert (Z.shape == (m, self.output_dim))
return Z

This is mostly used before the loss function. Since the number of classes is 6 so the output shape of this layer would be number of classes.

Softmax Loss

Since its a multi class classification problem so we would use softmax loss function which is also called Categorical Cross-Entropy loss. We would use softmax activation function to generate probability for each individual class with all probability the sum of one as shown in Figure 1.6.

Figure 1.6 Softmax probability
def softmax(z):
"""

:param Z: output of previous layer of shape (m, 6)
:return: probabilties of shape (m, 6)
"""

# numerical stability
z = z - np.expand_dims(np.max(z, axis=1), 1)
z = np.exp(z)
ax_sum = np.expand_dims(np.sum(z, axis=1), 1)

# finally: divide elementwise
A = z / ax_sum
return A

Softmax function is prone to two issues: overflow and underflow

Overflow: it means that incase of exploding gradients, weights could increase significantly which makes the probability useless.

Underflow: It occurs incase of vanishing gradients, weights could be close to zero and thus share the same probability.

To combat these issues when doing softmax computation, a common trick is to shift the input vector by subtracting the maximum element in it from all elements. For the input vector z, define z such that:

z = z - np.expand_dims(np.max(z, axis=1), 1)

The loss function over m training data is defined as:

Figure 1.7 Softmax Loss function
def softmaxloss(x, labels):
"""
:param x: output of previous layer of shape (m, 6)
:param labels: class labels of shape (1, m)
:return:
"""

one_hot_labels = convert_to_one_hot(labels, 6)
predictions = softmax(x)
epsilon = 1e-12
predictions = np.clip(predictions, epsilon, 1. - epsilon)
N = predictions.shape[0]
loss = -np.sum(one_hot_labels * np.log(predictions + 1e-9)) / N
grad = predictions.copy()
grad[range(N), labels] -= 1
grad /= N
return loss, grad

In the loss function we have clipped the softmax predictions by epsilon to prevent extremely large values that could lead to numerical instability.

epsilon = 1e-12
predictions = np.clip(predictions, epsilon, 1. - epsilon)

Backward Propagation

Softmax Loss

Now we would propagate our gradients back to the first layer. First we would compute derivative of cross entropy loss with softmax with respect to dense layer (input to softmax).

grad = predictions.copy()
grad[range(N), labels] -= 1
grad /= N

Dense Layer

In Dense layer, we receive as input, gradient of cross entropy loss with respect to dense layer. We then compute dW and db and dA_prev. Note we would use loss or cost function interchangeably.

Figure 1.8 Gradient of weight with respect to cost function.
Figure 1.9 Gradient of bias with respect to cost function.
Figure 1.10 Gradient of cost with respect to the input of Dense layer.
def backward(self, dA):
"""
Backward propogation for Dense layer.

Parameters:
dA -- gradient of cost with respect to the output of the Dense layer, same shape as Z

Returns:
dA_prev -- gradient of cost with respect to the input of the . Dense layer, same shape as A_prev

"""

np.random.seed(self.seed)
m = self.A_prev.shape[0]
dW = np.dot(self.A_prev.T, dA)
db = np.sum(dA, axis=0, keepdims=True)
dA_prev = np.dot(dA, self.W.T)
assert (dA_prev.shape == self.A_prev.shape)
assert (dW.shape == self.params[0].shape)
assert (db.shape == self.params[1].shape)

return dA_prev, [dW, db]

Flatten layer

Flatten layer has no parameters to train so we would not compute dW and db. For propagating gradient backward it just reshapes dA to A_prev which is (m, 10240).

def backward(self, dA):
"""
Backward propogation of flatten layer.

Parameters:
dA -- gradient of cost with respect to the output of the flatten layer, same shape as Z

Returns:
dA_prev -- gradient of cost with respect to the input of the flatten layer, same shape as A_prev

"""
np.random.seed(self.seed)
dA_prev = dA.reshape(self.A_prev.shape)
assert (dA_prev.shape == self.A_prev.shape)
return dA_prev, []

Maxpool Layer

Before doing backpropogation we would create a function which keeps track of where the maximum of the matrix is. True (1) indicates the position of the maximum in matrix, the other entries are False (0).

Figure 2.1 Mask to find maximum value in the matrix
def create_mask_from_window(self, image_slice):
"""
Get mask from a image_slice to identify the max entry.

Parameters:
image_slice -- numpy array of shape (f, f, n_C_prev)

Returns:
mask -- Array of the same shape as window, contains a True at the position corresponding to the max entry of image_slice.

"""
mask = np.max(image_slice)
mask = (image_slice == mask)
return mask

We keep track of the maximum value in the matirx because this is the input value that ultimately influenced the output, and therefore the cost. Backprop is computing gradients with respect to the cost, so anything that influences the ultimate cost should have a non-zero gradient. So, back propagation will “propagate” the gradient back to this particular input value that had influenced the cost.

Since Maxpool has no parameters so we would not compute dW and db.

def backward(self, dA):
"""
Backward propogation of the pooling layer.

Parameters:
dA -- gradient of cost with respect to the output of the pooling layer,
same shape as Z

Returns:
dA_prev -- gradient of cost with respect to the input of the pooling layer,
same shape as A_prev

"""
m, n_H_prev, n_W_prev, n_C_prev = self.A_prev.shape
m, n_H, n_W, n_C = dA.shape
dA_prev = np.zeros((m, n_H_prev, n_W_prev, n_C_prev))
for i in range(m):
a_prev = self.A_prev[i]
for h in range(n_H):
for w in range(n_W):
for c in range(n_C):
vert_start, vert_end, horiz_start, horiz_end = self.get_corners(h, w, self.filter_size, self.stride)
a_prev_slice = a_prev[vert_start:vert_end, horiz_start:horiz_end, c]
mask =self.create_mask_from_window(a_prev_slice)
dA_prev[i, vert_start: vert_end, horiz_start: horiz_end, c] += mask * dA[i, h, w, c]
assert(dA_prev.shape == self.A_prev.shape)
return dA_prev, []

Relu Layer

Backward propagation in relu is shown in Figure 2.2

Figure 2.2 Gradient of cost with respect to input of relu.
def backward(self, dA):
"""
Backward propogation of relu layer.

f′(x) = {1 if x > 0}
{0 otherwise}

Parameters:
dA -- gradient of cost with respect to the output of the relu layer, same shape as A

Returns:
dZ -- gradient of cost with respect to the input of the relu layer, same shape as Z

"""
Z = self.Z
dZ = np.array(dA, copy=True)
dZ[Z <= 0] = 0
assert (dZ.shape == self.Z.shape)
return dZ, []

Convolution Layer

In convolution layer we would compute three gradients, dA, dW, db.

Figure 2.3 dA with respect to the cost for a certain filter.

In Figure 2.3, Wc is a filter and dZhw is a scalar corresponding to the gradient of the cost with respect to the output of the convolution layer Z at the hth row and wth column (corresponding to the dot product taken at the ith stride left and jth stride down). Note that at each time, we multiply the the same filter Wc by a different dZ when updating dA. We do so mainly because when computing the forward propagation, each filter is dotted and summed by a different a_slice. Therefore when computing the backprop for dA, we are just adding the gradients of all the a_slices. The formula in Figure 2.3 translates to the following code in back propagation:

da_prev_pad[vert_start:vert_end, horiz_start:horiz_end, :] += self.params[0][:, :, :, c] * dZ[i, h, w, c]
Figure 2.4 Gradient of one filter with respect to the loss.

aslice corresponds to the slice which was used to generate the activation Zij. Hence, this ends up giving us the gradient for W with respect to that slice. Since it is the same W, we will just add up all such gradients to get dW.

The formula in figure 2.4 translates to the following code in backpropagation.

dW[:, :, :, c] += a_slice_prev * dZ[i, h, w, c]
Figure 2.5 Gradient of bias with respect to the cost for a certain filter.

The formula in Figure 2.4 translates to the following code in backpropagation.

db[:, :, :, c] += dZ[i, h, w, c]

Bringing it all together, the backward propagation of convolution layer is given below:

def backward(self, dZ):
"""
Backward propagation for convolution.

Parameters:
dZ -- gradient of the cost with respect to the output of the conv layer (Z), numpy array of shape (m, n_H, n_W, n_C)

Returns:
dA_prev -- gradient of the cost with respect to the input of the conv layer (A_prev), numpy array of shape (m, n_H_prev, n_W_prev, n_C_prev)
dW -- gradient of the cost with respect to the weights of the conv layer (W) numpy array of shape (f, f, n_C_prev, n_C)
db -- gradient of the cost with respect to the biases of the conv layer (b) numpy array of shape (1, 1, 1, n_C)

"""
np.random.seed(self.seed)
m, n_H_prev, n_W_prev, n_C_prev = self.A_prev.shape
f, f, n_C_prev, n_C = self.params[0].shape
m, n_H, n_W, n_C = dZ.shape
dA_prev = np.zeros(self.A_prev.shape)
dW = np.zeros(self.params[0].shape)
db = np.zeros(self.params[1].shape)
# Pad A_prev and dA_prev
A_prev_pad = self.zero_pad(self.A_prev, self.pad)
dA_prev_pad = self.zero_pad(dA_prev, self.pad)
for i in range(m):
a_prev_pad = A_prev_pad[i, :, :, :]
da_prev_pad = dA_prev_pad[i, :, :, :]
for h in range(n_H):
for w in range(n_W):
for c in range(n_C):
vert_start, vert_end, horiz_start, horiz_end = self.get_corners(h, w, self.filter_size, self.stride)
a_slice_prev = a_prev_pad[
vert_start:vert_end, horiz_start:horiz_end, :]
da_prev_pad[vert_start:vert_end, horiz_start:horiz_end, :] += self.params[0][:, :, :, c] * dZ[i, h, w, c]
dW[:, :, :, c] += a_slice_prev * dZ[i, h, w, c]
db[:, :, :, c] += dZ[i, h, w, c] dA_prev[i, :, :, :] = da_prev_pad[self.pad:-self.pad, self.pad:-self.pad, :]
assert(dA_prev.shape == (m, n_H_prev, n_W_prev, n_C_prev))
return dA_prev, [dW, db]

Gradient checking

Gradient check is very useful in verifying your back propagation and that you have computed the gradients correctly. It uses two sided difference to numerically approximate gradients. We would randomly select 2 data points from training data and would run gradient check on it. NOTE: Since gradient checking is very slow, do not use it during training.

def grad_check():

train_set_x, train_set_y, test_set_x, test_set_y, n_class = load_data()
# select randomly 2 data points from training data
n = 2
index = np.random.choice(train_set_x.shape[0], n)
train_set_x = train_set_x[index]
train_set_y = train_set_y[:, index]
cnn = make_model(train_set_x, n_class)
print (cnn.layers)
A = cnn.forward(train_set_x)
loss, dA = softmaxloss(A, train_set_y)
assert (A.shape == dA.shape)
grads = cnn.backward(dA)
grads_values = grads_to_vector(grads)
initial_params = cnn.params
parameters_values = params_to_vector(initial_params) # initial parameters
num_parameters = parameters_values.shape[0]
J_plus = np.zeros((num_parameters, 1))
J_minus = np.zeros((num_parameters, 1))
gradapprox = np.zeros((num_parameters, 1))
print ('number of parameters: ', num_parameters)
epsilon = 1e-7
assert (len(grads_values) == len(parameters_values))
for i in tqdm(range(0, num_parameters)):

thetaplus = copy.deepcopy(parameters_values)
thetaplus[i][0] = thetaplus[i][0] + epsilon # parameters
new_param = vector_to_param(thetaplus, initial_params)
difference = compare(new_param, initial_params)
# make sure only one parameter is changed
assert ( difference == 1)
cnn.params = new_param
A = cnn.forward(train_set_x)
J_plus[i], _ = softmaxloss(A, train_set_y)

thetaminus = copy.deepcopy(parameters_values)
thetaminus[i][0] = thetaminus[i][0] - epsilon
new_param = vector_to_param(thetaminus, initial_params)
difference = compare(new_param, initial_params)
# make sure only one parameter is changed
assert (difference == 1)
cnn.params = new_param
A = cnn.forward(train_set_x)
J_minus[i], _ = softmaxloss(A, train_set_y)

gradapprox[i] = (J_plus[i] - J_minus[i]) / (2 * epsilon)

numerator = np.linalg.norm(gradapprox - grads_values)
denominator = np.linalg.norm(grads_values) + np.linalg.norm(gradapprox)
difference = numerator / denominator

if difference > 2e-7:
print("\033[93m" + "There is a mistake in the backward propagation! difference = " + str(
difference) + "\033[0m")
else:
print("\033[92m" + "Your backward propagation works perfectly fine! difference = " + str(
difference) + "\033[0m")

return difference
Figure 2.6 Gradient check results.

If your backpropogation works, it would output a message such as in Figure 2.6. If there is some mistake in your back propogation one thing would be to compare individual values of approximate gradients and original gradients and check where the difference is large and look for the implementation of those gradients. One more thing to note we can encounter kinks which can be a source of inaccuracy for failing grad check. Kinks refers to non-differentiable parts of objective function, introduced by functions such as ReLU (max(0,x)). For instance, consider gradient checking at x = -1e-8. As you may recall ReLU backward propagation in Figure 2.2, since x < 0, it would compute a zero gradient. However when computing two sided difference (x + epsilon), epsilon given as 1e-7 would compute 9e-08, which would introduce a non zero gradient.

Adam Optimizer

Adam optimizer is used as an optimization algorithm to minimize the loss function. It has known to work well with a variety of problems. Hyper parameters for Adam are: learning rate, beta1, beta2, epsilon. The default choice for beta1 is 0.9 and default choice for beta2 is 0.999. The choice of epsilon does not matter very much and it is set to 1e-08. Generally all the other hyper parameters are used with default value and learning rate is tuned. Beta1 is computing the mean of the derivatives which is called first moment and Beta2 is used to compute exponentially weighted average of the squares. which is called second moment. Adam has relatively low memory requirements and usually works well even with little tuning of hyper parameters except learning rate. [Kingma et al. 2014]

class Adam(object):

def __init__(self, model, X_train, y_train,
learning_rate, epoch, minibatch_size, X_test, y_test):
self.model = model
self.X_train = X_train
self.y_train = y_train
self.learning_rate = learning_rate
self.beta1 = 0.9
self.beta2 = 0.999
self.epsilon = 1e-08
self.epoch = epoch
self.X_test = X_test
self.y_test = y_test
self.num_layer = len(self.model.layers)
self.minibatch_size = minibatch_size

def initialize_adam(self):
VdW, Vdb, SdW, Sdb = [], [], [], []
for param_layer in self.model.params:
# layers which has no learning
if len(param_layer) is not 2:
VdW.append(np.zeros_like([]))
Vdb.append(np.zeros_like([]))
SdW.append(np.zeros_like([]))
Sdb.append(np.zeros_like([]))
else:
VdW.append(np.zeros_like(param_layer[0]))
Vdb.append(np.zeros_like(param_layer[1]))
SdW.append(np.zeros_like(param_layer[0]))
Sdb.append(np.zeros_like(param_layer[1]))

assert len(VdW) == self.num_layer
assert len(Vdb) == self.num_layer
assert len(SdW) == self.num_layer
assert len(Sdb) == self.num_layer

return VdW, Vdb, SdW, Sdb

def update_parameters(self, VdW, Vdb, SdW, Sdb, grads, t):

VdW_corrected = [np.zeros_like(v) for v in VdW]
Vdb_corrected = [np.zeros_like(v) for v in Vdb]
SdW_corrected = [np.zeros_like(s) for s in SdW]
Sdb_corrected = [np.zeros_like(s) for s in Sdb]

# compute dW, db using current mini batch

grads = list(reversed(grads))
for i in range(len(grads)):
# layer which contains weights and biases
if len(grads[i]) is not 0:
# Moving average of the gradients (Momentum)

a = self.beta1 * VdW[i]
b = (1 - self.beta1) * grads[i][0]
VdW[i] = np.add(a, b)

a = self.beta1 * Vdb[i]
b = (1 - self.beta1) * grads[i][1]
Vdb[i] = np.add(a, b)

# Moving average of the squared gradients. (RMSprop)
a = self.beta2 * SdW[i]
b = (1-self.beta2) * np.power(grads[i][0], 2)
SdW[i] = np.add(a, b)

a = self.beta2 * Sdb[i]
b = (1-self.beta2) * np.power(grads[i][1], 2)
Sdb[i] = np.add(a, b)

# Compute bias-corrected first moment estimate

den = (1-(self.beta1 ** t))
VdW_corrected[i] = np.divide(VdW[i], den)
Vdb_corrected[i] = np.divide(Vdb[i], den)

# Compute bias-corrected second raw moment estimate
den = 1-(self.beta2 ** t)
SdW_corrected[i] = np.divide(SdW[i], den)
Sdb_corrected[i] = np.divide(Sdb[i], den)

# weight update
den = np.sqrt(SdW_corrected[i]) + self.epsilon
self.model.params[i][0] = self.model.params[i][0] - self.learning_rate * np.divide(VdW_corrected[i], den)

# bias update
den = np.sqrt(Sdb_corrected[i]) + self.epsilon
self.model.params[i][1] = self.model.params[i][1] - self.learning_rate * np.divide(Vdb_corrected[i], den)

def minimize(self):
costs = []
t = 0
np.random.seed(1)
VdW, Vdb, SdW, Sdb = self.initialize_adam()
for i in tqdm(range(self.epoch)):
start = time.time()
loss = 0
minibatches = get_minibatches(self.X_train,
self.y_train,
self.minibatch_size)
for minibatch in tqdm(minibatches):
# Select a minibatch
(minibatch_X, minibatch_Y) = minibatch
# forward and backward propogation
loss, grads = self.model.fit(minibatch_X, minibatch_Y)
loss += loss
t = t + 1 # Adam counter
# weight update
self.update_parameters(VdW, Vdb, SdW, Sdb, grads, t)

# Print the cost every epoch
end = time.time()
epoch_time = end - start
train_acc = accuracy(self.model.predict(self.X_train),
self.y_train)
val_acc = accuracy(self.model.predict(self.X_test),
self.y_test)
print ("Cost after epoch %i: %f" % (i, loss),
'time (s):', epoch_time,
'train_acc:', train_acc,
'val_acc:', val_acc)
costs.append(loss)
print ('total_cost', costs)

return self.model, costs

Now we have reached to the end of the article and hopefully you have followed along. If you like it don’t forget to give it a thumbs up 🙂

You can add me on LinkedIn: https://www.linkedin.com/in/mustufain-abbas/

The code for the article can be found at: https://github.com/Mustufain/Convolution-Neural-Network-

References

Andrew Ng course on coursera : https://www.coursera.org/learn/convolutional-neural-networks-tensorflow

He, Kaiming, et al. “Delving deep into rectifiers: Surpassing human-level performance on imagenet classification.” Proceedings of the IEEE international conference on computer vision. 2015

Kingma, Diederik P., and Jimmy Ba. “Adam: A method for stochastic optimization.” arXiv preprint arXiv:1412.6980(2014).

Zeiler, Matthew D., and Rob Fergus. “Visualizing and understanding convolutional networks.” European conference on computer vision. Springer, Cham, 2014.